The trouble with triples#

Sat, 26 Mar 2016 16:29:15 +0000

The other day I had occasion to write

(defn triples-to-map [triples]
  (reduce (fn [m row]
            (update-in m (butlast row)
                       (fn [old new] (if old (conj old new) [new]))
                       (last row)))
          {}
          triples))

and be surprised and delighted that it ran first time with the expected result. As witness:

foo.search=> (clojure.pprint/pprint triples_)
([:bnb:016691109 :published "2014"]
 [:bnb:016691109 :title "The Seven Streets of Liverpool"]
 [:bnb:016691109 :publisher "Orion"]
 [:bnb:016691109 :schema :shlv:Book]
 [:bnb:016691109 :author "Lee, Maureen"]
 [:bnb:016594932 :published "2013"]
 [:bnb:016594932 :title "Stephen Guy's forgotten Liverpool"]
 [:bnb:016594932 :publisher "Trinity Mirror"]
 [:bnb:016594932 :schema :shlv:Book]
 [:bnb:016594932 :author "Guy, Stephen"]
 [:bnb:016242841 :published "2012"]
 [:bnb:016242841
  :title
  "Robbed : my Liverpool life : the Rob Jones story"]
 [:bnb:016242841 :publisher "Kids Academy Publishing"]
 [:bnb:016242841 :schema :shlv:Book]
 [:bnb:016242841 :author "Jones, Rob, 1971-"]
 [:bnb:016744037 :published "2012"]
 [:bnb:016744037 :title "Steven Gerrard : my Liverpool story"]
 [:bnb:016744037 :publisher "Headline"]
 [:bnb:016744037 :schema :shlv:Book]
 [:bnb:016744037 :author "Gerrard, Steven, 1980-"])
foo.search=> (clojure.pprint/pprint (triples-to-map triples_))
{:bnb:016691109
 {:published ["2014"],
  :title ["The Seven Streets of Liverpool"],
  :publisher ["Orion"],
  :schema [:shlv:Book],
  :author ["Lee, Maureen"]},
 :bnb:016594932
 {:published ["2013"],
  :title ["Stephen Guy's forgotten Liverpool"],
  :publisher ["Trinity Mirror"],
  :schema [:shlv:Book],
  :author ["Guy, Stephen"]},
 :bnb:016242841
 {:published ["2012"],
  :title ["Robbed : my Liverpool life : the Rob Jones story"],
  :publisher ["Kids Academy Publishing"],
  :schema [:shlv:Book],
  :author ["Jones, Rob, 1971-"]},
 :bnb:016744037
 {:published ["2012"],
  :title ["Steven Gerrard : my Liverpool story"],
  :publisher ["Headline"],
  :schema [:shlv:Book],
  :author ["Gerrard, Steven, 1980-"]}}
nil

(Now I write that code down for the second time I wonder whether using update-in is slightly overkill when I know the map will only ever be two levels deep. But that's not something I'm interested in right now.)

What I'm interested in right now is that the input list for this function is itself the output of some other code which - mostly thanks to Instaparse - was unexpectedly easy to write. I've been playing around lately with RDF and the Semantic Web, and needed a way of parsing N-Triples - which looks superficially simple enough that Awk could do it, until you start thinking about comments and strings with spaces in them and escaped special characters and ...

Anyway, Instaparse steps in to save the day again. I believe I have written previously to give my opinion that Instaparse is awesome and I will go on record to say that this fresh experience merely serves to cement my first impression.

N-Triples has a published EBNF grammar . I had to monkey with this a bit to get it into Instaparse

instaparse doesn't understand "character classes", so productions of the form [0-9] had to be rewritten in a regex form #"[0-9]"

there was nothing in there about whitespace or comments. I added the WS production which allows both, and scattered it into appropriate-looking places

to simplify the code that walks the parse tree, I added a production for IRI so I could make instaparse strip off the < and > around IRI references for me. STRING_LITERAL is likewise my creation

#xnn notation for hex-encoded characters wasn't understood, so I swapped for \u or \x{nn}

Here's the final result

ntriplesDoc 	::= line*
line ::= WS* triple? EOL
triple 	::= 	subject WS* predicate WS* object WS* '.' WS*
subject 	::= 	IRIREF | BLANK_NODE_LABEL
predicate 	::= 	IRIREF
object 	::= 	IRIREF | BLANK_NODE_LABEL | literal
literal 	::= 	STRING_LITERAL_QUOTED ('^^' IRIREF | LANGTAG)?
LANGTAG 	::= 	'@' #"[a-zA-Z]"+ ('-' #"[a-zA-Z0-9]"+)*
EOL 	::= 	#"[\n\r]"+ 
WS 	::= 	#"[ \t]" | #"#.*"
IRIREF 	::= 	'<' IRI '>'
IRI ::= (#"[^\u0000-\u0020<>\"{}|^`\\]" | UCHAR)*
STRING_LITERAL_QUOTED 	::= 	'"' STRING_LITERAL  '"'
STRING_LITERAL ::= ( #"[^\u0022\u005C\u000A\u000D]" | ECHAR | UCHAR)*
BLANK_NODE_LABEL 	::= 	'_:' (PN_CHARS_U | #"[0-9]") ((PN_CHARS | '.')* PN_CHARS)?
UCHAR 	::= 	'\\u' HEX HEX HEX HEX | '\\U' HEX HEX HEX HEX HEX HEX HEX HEX
ECHAR ::= "\\" #"[tbnrf\"\'\\]"
HEX ::= #"[0-9A-Fa-f]"
PN_CHARS_BASE 	::= 	#"[A-Z]" | #"[a-z]" | #"[\u00C0-\u00D6]" |
                        #"[\u00D8-\u00F6]"  | #"[\u00F8-\u02FF]" | 
                        #"[\u0370-\u037D]"  | #"[\u037F-\u1FFF]" | 
                        #"[\u200C-\u200D]"  | #"[\u2070-\u218F]" |
                        #"[\u2C00-\u2FEF]"  | #"[\u3001-\uD7FF]" | 
                        #"[\uF900-\uFDCF]"  | #"[\uFDF0-\uFFFD]" | 
                        #"[\x{10000}-\x{EFFFF}]" 
PN_CHARS_U ::= PN_CHARS_BASE | ":" | "_" 
PN_CHARS 	::= 	PN_CHARS_U | "-" | #"[0-9]" | "\u00B7" | 
                        #"[\u0300-\u036F]" | #"[\u203F-\u2040]"

Calling insta/parse with this grammar on a sample line gets you something looking like

[:ntriplesDoc
 [:line
  [:triple
   [:subject
    [:IRIREF "<" [:IRI  "h" "t" "t" "p" ":" "/" "/" "b" "n" "b" "."
                        "d" "a" "t" "a" "."  "b" "l" "."  "u" "k" "/" "i" "d"
                        "/" "r" "e" "s" "o" "u" "r" "c" "e" "/" "0" "1" "6" "7"
                        "0" "6" "8" "5" "5"] ">"]]
   [:WS " "]
   [:predicate
    [:IRIREF "<" [:IRI "h" "t" "t" "p" ":" "/" "/" "l" "o" "c" "a" "l"
                       "h" "o" "s" "t" ":" "3" "0" "3" "0" "/" "p" "u"
                       "b" "l" "i" "s" "h" "e" "d"] ">"]]
   [:WS " "] [:object [:literal [:STRING_LITERAL_QUOTED
      "\"" [:STRING_LITERAL "2" "0" "1" "4"] "\""]]] [:WS " "]
   "."]
  [:EOL "\n"]]]

which clearly is going to need some more attention before it's usable. We do this in two passes: first we visit the entire tree node-by-node to do things like turn literal node values into strings and IRI nodes into URI objects.

(defn visit-node [branch]
  (if (vector? branch)
    (case (first branch)
      :IRIREF
      (let [[_< [_iri_tok & letters] _>] (rest branch)
            iri (str/join letters)]
        (or (prefixize iri)
            (URI. iri)))
      :STRING_LITERAL (str/join (rest branch))
      :STRING_LITERAL_QUOTED (let [[_ string _] (rest branch)] string)
      :literal (second branch)
      :WS ""
      :UCHAR (let [[_ & hexs] (rest branch)]
               (String.
                (Character/toChars
                 (Integer/parseInt (str/join (map second hexs)) 16))))
      :triple (let [m (reduce (fn [m [k v]] (assoc m k v)) {}
                              (rest branch))]
                [:triple [(:subject m) (:predicate m) (:object m)]])
      branch)
    branch))

Then we transform the tree into a seq and filter the seq to get only the :triple nodes. Putting it all together:

(defn parse-n-triples [in-string]
  (->> in-string
       (insta/parse n-triple-parser)
       (walk/postwalk visit-node)
       (tree-seq #(and (vector? %)
                       (keyword? (first %))
                       (not (= (first %) :triple)))
                 #(rest %))
       (filter #(= (first %) :triple))
       (map second)))

I'm reasonably confident that the grammar is correct: I pushed all the official N-Triples Test Suite through it without error. My post-parsing massage passes, though, are possibly not correct and certainly not complete, which is one reason I'm just blogging about it instead of publishing it as a standalone library somewhere. Things I already know it doesn't do: blank node support, language tags, datatypes, escaped characters. Things I don't know it doesn't do: don't know. But it seems to work for my use case - of which, more later.