Pure Danger Tech


navigation
home

Using Clojure and clj-plaza to play with RDF data

24 Jun 2010

Yesterday I gave a demo of the clj-plaza Clojure library for working with RDF data at the Jena User’s Group meeting during Semtech. This demo was REPL-only so I don’t have any slides to post but I thought perhaps an annotated version of the script I used would be useful for someone. Many thanks to Antonio Garrote for writing the library.

For the purposes of the demo, I used Netbeans and Enclojure and defined my classpath using a Maven pom. I won’t go into details on it as it’s pretty straightforward (but still >100 lines long). It pulls in the plaza dependencies and uses the Maven clojure plugin. Once I opened this project in Netbeans, I just started a REPL from the project. You can find the pom here if you want it.

Clojure

Because many people in the audience were not familiar with Clojure, I did a short intro to Clojure itself. Clojure is a dynamically-typed Lisp dialect that runs on the JVM. I was drawn to it because of:

  1. Lisp – excellent capabilities for abstraction, flexibility, and functional programming
  2. JVM – able to leverage all of the features of the JVM like garbage collection, dynamic performance optimization, portable environment and its embrace of Java interoperability to get access to the vast wealth of portable Java libs.
  3. concurrency – Clojure starts from a base of immutable persistent data structures and builds a managed way to provide identity over time pointing to an evolving snapshot of immutable data. Changes are made through explicit state-changing functions that occur in the context of a software transactional memory system. Reads can be done at any time by just getting the current snapshot.

This demo will just work from the REPL starting with some basic Clojure syntax. The REPL is kind of like the shell you might use in Ruby or Groovy or whatever other non-Java language you use except that in Clojure, the REPL is much closer to the heart of what Clojure (or any Lisp) is. REPL = read-eval-print loop. The reader reads and creates Clojure data structures, eval evaluates those data structures, and print can output Clojure data structures.

Clojure uses prefix notation that starts with a function and is followed by it’s arguments:

user=> (+ 2 2)
4
user=> (+ 1 (* 3 5))
16

You can define variables in your current namespace:

user=> (def v 5)
#'user/v
user=> v
5
user=> (+ v v)
10

Lisps are inherently dependent on the list data structure where you can think of a list as a linked-list where new items are added to the head. A list is represented by the ubiquitous parentheses: (1 3 5). You can try to put a list at the REPL but it won’t work because Lisp wants to read that list *and evaluate it*, treating the first item as a function. However, you can use other functions to explicitly create lists:

user=> (1 3 5)
#<CompilerException java.lang.ClassCastException: java.lang.Integer cannot be cast to clojure.lang.IFn (NO_SOURCE_FILE:2)>
user=> (list 1 3 5)
(1 3 5)
user=> (quote (1 3 5))
(1 3 5)
user=> '(1 3 5)
(1 3 5)

A $5 word you might hear is “homoiconic” by which people mean that Clojure *code* is represented in terms of Clojure *data structures*. This is in opposition to most languages people use today where code is represented with a bunch of syntax understood via an abstract syntax tree. This allows us to generate code as data and execute it:

user=> (def foo '(+ 5 5))
#'user/foo
user=> (eval foo)
10

This is deep and important and you should read something more useful than this blog to understand why. :)

An extremely important part of Clojure is its set of core data structures, so we’ll take a brief look at a few of them. I’ve already mentioned lists and how to create them. You can also add things to them with conj (in which case they are pushed on the head), grab the first thing or the rest of the things in the list:

user=> (def a '(1 2))
#'user/a
user=> (conj a 3 4)
(4 3 1 2)
user=> (first a)
1
user=> (rest a)
(2)

Vectors are denoted by [ ] and differ from lists in that they are more like ArrayLists in Java and things append to the tail, not to the head. Because they are not eagerly evaluated as code, it is often more convenient (and common) to use vectors to build intermediate data structures that you pass around in Clojure.

user=> [1 2 3]
[1 2 3]
user=> (def v [1 2 3])
#'user/v
user=> (conj v 4)
[1 2 3 4]
user=> (first v)
1
user=> (rest v)
(2 3)

Maps are denoted by { } and are kind of like HashMap in Java. The representation consists of “key value key value …”. If you like you can use commas to separate key-value pairs as commas are treated as whitespace in Clojure. The first and rest functions work over a sequence of key-value pairs from the map. In Clojure, maps *are* functions of the key that return the value so you can use the map as a function.

user=> {1 2 3 4}
{1 2, 3 4}
user=> (keys {1 2 3 4})
(1 3)
user=> (vals {1 2 3 4})
(2 4)
user=> (first {1 2 3 4})
[1 2]
user=> (rest {1 2 3 4})
([3 4])

There are also sets and queues if you need them.

A key aspect of Clojure is functional programming and it has a rich set of functions for doing FP type stuff. fn is a special form to create a function which can be named and used. Most commonly, you’ll use the helpful defn macro to do this though.

user=> (def f (fn [a] (* a a)))
#'user/f
user=> (f 5)
25
user=> (defn f [a] (* a a))
#'user/f
user=> (f 5)
25

If we define a new function, we can call it just like any other function by passing it some arguments. Some classic FP functions we might want to call are map (which applies a function over a sequence, reduce to reduce a sequence to a result, and filter to pull matching elements of a sequence based on a criteria function):

user=> (defn x10 [a] (* a 10))
#'user/x10
user=> (def r (range 10))
#'user/r
user=> r
(0 1 2 3 4 5 6 7 8 9)
user=> (map x10 r)
(0 10 20 30 40 50 60 70 80 90)
user=> (map * r r)
(0 1 4 9 16 25 36 49 64 81)
user=> (map #(* % %) r)
(0 1 4 9 16 25 36 49 64 81)
user=> (reduce + r)
45
user=> (reduce * r)
0
user=> (filter odd? r)
(1 3 5 7 9)

The #( … %) business is syntactic sugar for an anonymous function where % is the item being evaluated. You might also have caught that you can walk multiple sequences at the same time with map and apply the items to the function.

clj-plaza

Ok, so enough basics, let’s look at the clj-plaza library, written by Antonio Garrote. clj-plaza has a bunch of useful semantic web functionality, including I/O, creation and observation of rdf data, and querying. Additionally there is support for creating a triple space (akin to the classic tuple space) and creating semantic RESTful services. I’m focusing just on the basics of working with RDF here.

To use plaza, we make the plaza namespace known to Clojure, and then import a file of Elvis impersonator data in RDF (thank you Internet) into a model:

user=> (use 'plaza.rdf.core)
nil
user=> (def e (document-to-model "http://www.snee.com/rdf/elvisimp.rdf" :xml))
#'user/e

If you wanted to load from a file instead, you might do this:

user=> (import java.io.FileInputStream)
java.io.FileInputStream
user=> (def e (document-to-model (new FileInputStream "data/elvisimp.rdf") :xml))
#'user/e

In both cases here we end up with a plaza model, which is actually just a Clojure agent protecting access to a Jena model:

user=> (class e)
clojure.lang.Agent
user=> (class @e)
com.hp.hpl.jena.rdf.model.impl.ModelCom

If you’re hearing the hype about RDFa, you might be interested in scraping some RDF data out of web pages. Jena supports this already and plaza makes it really trivial to scrape the web for the data. Here I scrape a slideshare.net page for one of my presentations:

user=> (reset-model)
#<Agent #<ModelCom <ModelCom   {} | >>>
user=> (def rdfa (document-to-model 	
		"http://www.slideshare.net/alexmiller/java-concurrency-gotchas-3666977" 
		:html))
#'user/rdfa

If you then want to save it somewhere you can convert the model to a string in n3 format and dump it to a file with the clojure core function spit:

user=> (def rdfa-str (with-out-str (model-to-format rdfa :n3)))
#'user/rdfa-str
user=> (print rdfa-str)
@prefix dc:      <http://purl.org/dc/terms/> .
@prefix hx:      <http://purl.org/NET/hinclude> .
@prefix media:   <http://search.yahoo.com/searchmonkey/media/> .
@prefix og:      <http://opengraphprotocol.org/schema/> .
@prefix fb:      <http://developers.facebook.com/schema/> .

<http://www.slideshare.net/alexmiller/java-concurrency-gotchas-3666977>
      fb:app_id "2490221586"@en ;
      og:image "http://cdn.slidesharecdn.com/concurrencygotchas-100408105435-phpapp01-thumbnail-2?1270742095"@en ;
      og:site_name "SlideShare"@en ;
      og:title "Java Concurrency Gotchas"@en ;
      og:type "article"@en ;
      og:url  "http://www.slideshare.net/alexmiller/java-concurrency-gotchas-3666977"@en ;
      dc:creator "Alex Miller"@en ;
      dc:description "Common Java concurrency problems and how to fix them."@en ;
      media:height "355"@en ;
      media:presentation <http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=concurrencygotchas-100408105435-phpapp01&stripped_title=java-concurrency-gotchas-3666977> ;
      media:thumbnail <http://cdn.slidesharecdn.com/concurrencygotchas-100408105435-phpapp01-thumbnail?1270742095> ;
      media:title "Java Concurrency Gotchas"@en ;
      media:width "425"@en ;
      <http://www.w3.org/1999/xhtml/vocab#alternate>
              <http://www.slideshare.net/rss/latest> ;
      <http://www.w3.org/1999/xhtml/vocab#icon>
              <http://www.slideshare.net/favicon.ico> ;
      <http://www.w3.org/1999/xhtml/vocab#stylesheet>
              <http://public.slidesharecdn.com/v3/styles/combined.css?1277383862> .
nil
user=> (spit "/Users/alex/Desktop/foo.n3" rdfa-str)

The clj-plaza lib also includes ways to easily create new RDF data, including resources, literals, and typed literals, from which we can create triples:

user=> (rdf-resource "http://example.org/foo")
#<ResourceImpl http://example.org/foo>
user=> (rdf-literal "abc")
#<LiteralImpl abc^^http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral>
user=> (rdf-literal "abc" "en")
#<LiteralImpl abc@en>
user=> (rdf-typed-literal 5)
#<LiteralImpl 5^^http://www.w3.org/2001/XMLSchema#int>
user=> (d 5)
#<LiteralImpl 5^^http://www.w3.org/2001/XMLSchema#int>

Here you can use a function l as a synonym for rdf-literal and d as a synonym for rdf-typed-literal.

We can make triples by just creating vectors of triples (also defined as vectors). We can then also define namespaces in a namespace registry to make this a bit more readable and writeable. Once you’ve built some triples in Clojure data structures, you can easily drop those into a model if you want to work with them from there:

user=> (make-triples [["http://example.org/Alex" "http://www.w3.org/1999/02/22-rdf-syntax-ns#type" "http://xmlns.com/foaf/0.1/Person"]])
[[#<ResourceImpl http://example.org/Alex>
  #<PropertyImpl http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
  #<ResourceImpl http://xmlns.com/foaf/0.1/Person>]]
user=> (register-rdf-ns :ex "http://example.org/")
{"http://example.org/" :ex,
 "http://www.w3.org/2000/01/rdf-schema#" :rdfs,
 "http://www.w3.org/1999/02/22-rdf-syntax-ns#" :rdf}
user=> (register-rdf-ns :foaf "http://xmlns.com/foaf/0.1/")
{"http://xmlns.com/foaf/0.1/" :foaf,
 "http://example.org/" :ex,
 "http://www.w3.org/2000/01/rdf-schema#" :rdfs,
 "http://www.w3.org/1999/02/22-rdf-syntax-ns#" :rdf}
user=> 	(make-triples [[[:ex :Alex] [:rdf :type] [:foaf :Person]]])
[[#<ResourceImpl http://example.org/Alex>
  #<PropertyImpl http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
  #<ResourceImpl http://xmlns.com/foaf/0.1/Person>]]
user=> (alter-root-rdf-ns "http://www.example.org/")
"http://www.example.org/"
user=> (def t (make-triples [[:Alex [:rdf :type] [:foaf :Person]]]))
#'user/t
user=> (def m (defmodel (model-add-triples t)))
#'user/m

There are also a bunch of helper functions for looking at triples and parts of triples and getting the information back out:

user=> (def et (model-to-triples e))
#'user/et
user=> (s (first t))
#<ResourceImpl http://www.example.org/Alex>
user=> (p (first t))
#<PropertyImpl http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
user=> (o (first t))
#<ResourceImpl http://xmlns.com/foaf/0.1/Person>
user=> (resource-uri (o (first t)))
"http://xmlns.com/foaf/0.1/Person"
user=> (literal-datatype-uri (d 5))
"http://www.w3.org/2001/XMLSchema#int"
user=> (literal-value (d 5))
5
user=> (literal-language (l "abc" "en"))
"en"

Since we can deal with triples in terms of basic Clojure data structures, we can then also apply all of the basic Clojure functions to them as well. For example we can grab all the predicates by just applying the plaza p function to every triple in the set (and take the first 10 here for simplicity):

user=> (take 10 (map #(p %) et))
(#<PropertyImpl http://purl.org/dc/elements/1.1/description>
 #<PropertyImpl http://purl.org/dc/elements/1.1/title>
 #<PropertyImpl http://purl.org/dc/elements/1.1/title>
 #<PropertyImpl http://purl.org/dc/elements/1.1/title>
 #<PropertyImpl http://purl.org/dc/elements/1.1/title>
 #<PropertyImpl http://purl.org/dc/elements/1.1/description>
 #<PropertyImpl http://purl.org/dc/elements/1.1/description>
 #<PropertyImpl http://www.snee.com/ns/epinfluences>
 #<PropertyImpl http://www.snee.com/ns/eppay-range>
 #<PropertyImpl http://www.snee.com/ns/epyear-established>)

Here we’re seeing the Jena PropertyImpl class wrapping those predicates but we can easily extract the uri from within the PropertyImpl too:

user=> (take 10 (map #(resource-uri (p %)) et))
("http://purl.org/dc/elements/1.1/description"
 "http://purl.org/dc/elements/1.1/title"
 "http://purl.org/dc/elements/1.1/title"
 "http://purl.org/dc/elements/1.1/title"
 "http://purl.org/dc/elements/1.1/title"
 "http://purl.org/dc/elements/1.1/description"
 "http://purl.org/dc/elements/1.1/description"
 "http://www.snee.com/ns/epinfluences"
 "http://www.snee.com/ns/eppay-range"
 "http://www.snee.com/ns/epyear-established")

We can see lots of duplicates here, so it’s easy to use the built-in clojure functions distinct (for duplicate removal) and sort to clean up our list of predicates in the data set:

user=> (sort (distinct (map #(resource-uri (p %)) et)))
("http://purl.org/dc/elements/1.1/creator"
 "http://purl.org/dc/elements/1.1/description"
 "http://purl.org/dc/elements/1.1/rights"
 "http://purl.org/dc/elements/1.1/title"
 "http://www.snee.com/ns/epaudio-sample"
 "http://www.snee.com/ns/epavailable-for"
 "http://www.snee.com/ns/epcategory"
 "http://www.snee.com/ns/epinfluences"
 "http://www.snee.com/ns/eplocation"
 "http://www.snee.com/ns/epname"
 "http://www.snee.com/ns/eppay-range"
 "http://www.snee.com/ns/epvideo-sample"
 "http://www.snee.com/ns/epyear-established")

Plaza has a mechanism to easily create a simple or complex predicate for finding matching triples in a triple set. This is done using the triple-check function (also a shortcut version called tc). Inside triple-check there are a set of predicate functions that can be combined using and?, or?, etc. For example, we can create a predicate that searches for any triples that have an object literal that contains the word “impersonator” in our Elvis data:

user=> (filter (triple-check 
                 (object-and? (is-literal?)
                              (regex? #"impersonator"))) et)
([#<ResourceImpl http://www.all4funchgo.bizland.com>
  #<PropertyImpl http://purl.org/dc/elements/1.1/description>
  #<LiteralImpl providing Elvis impersonators for parties, singing telegrams, and corporate events serving the Chicago metro area.>]
...etc...
)

If we wanted we could take that output set of triples and further manipulate it either with predicates or built-in Clojure functionality.

Plaza also has a mechanism to create patterns (basically SPARQL graph patterns) and filters (SPARQL filters) and ways to apply these patterns and filters directly to a model or a set of triples in vector form.

user=> (use 'plaza.rdf.sparql)
nil
user=> (def elvez (make-pattern [[:?s :?p (d "El Vez")]]))
         
#'user/elvez
user=> (pattern-apply et elvez)
([[#<ResourceImpl http://members.aol.com/elvezco>
   #<ResourceImpl http://purl.org/dc/elements/1.1/title>
   #<LiteralImpl El Vez^^http://www.w3.org/2001/XMLSchema#string>]])

You can also use plaza syntax to create full sparql queries. You can dump those as a string or apply them to either a model or a set of triple vectors.

user=> (def q (defquery
		(query-set-type :select)
		(query-set-vars [:?s :?p])
		(query-set-pattern elvez)))
#'user/q
user=> (query-to-string q)	
"SELECT  ?s ?p\nWHERE\n  { ?s  ?p  \"El Vez\"^^<http://www.w3.org/2001/XMLSchema#string> . }\n"
user=> (model-query e q)
({:?p
  #<ResourceImpl http://purl.org/dc/elements/1.1/title>,
  :?s
  #<ResourceImpl http://members.aol.com/elvezco>})

Note that the results here are a sequence of maps where each is keyed by selected variables.

If we want to work with sparql directly instead of building these queries, it’s easy to also go from a sparql string to a query that can be used directly as well:

user=> (sparql-to-query "SELECT ?s ?p WHERE { ?s ?p \"El Vez\" }")
{:vars [:s :p],
 :filters (),
 :pattern
 ([:?s
   :?p
   #<LiteralImpl El Vez>]),
 :kind :select}

I’m really just scratching the surface of both Clojure and the plaza library here, but hopefully I’ve given you a taste of what’s interesting about using them to easily read, write, and manipulate RDF data.