Mapping relational data to the semantic web

9

I thought it would be interesting to share some of what we’re working on at Revelytix and also ask for comment on an initial piece of it. I’ll start with a little background first. Much of this is also covered in my recent slides from SemTech and that might be a useful set of complementary pictures.

Revelytix is currently working on a project for the US Department of Defense. The aim is to integrate information from across the various service branches within DOD. The Office of the Secretary of Defense (OSD) does not actually manage or control much of data within the DOD – instead that data is managed independently in the various service branches, often in many different databases.

The solution we’re building integrates data from these service branches using semantic web techniques, which allows the OSD to ask questions of this federated data and get back answers in a combined way that is not currently possible. For a high level architectural overview (including this project as only one example), I would recommend this talk by Dennis Wisnosky.

Getting the data from those service branches requires talking to sources that are not actually providing data or answering queries in the languages used in the semantic web (RDF and SPARQL respectively). In particular, much of the data resides in traditional relational databases like Oracle.

The general problem of exposing relational data for use in the semantic web has been explored for years. There is a W3C standards group (the RDB2RDF working group) actively working on standardizing the approach to these problems. Some important dimensions to consider in this process are:

  • Usage – most existing projects allow you to access the data in one of two ways: 1) an ETL process that does a bulk dump of all relational data into an RDF triple store, or 2) dynamic query translation from SPARQL to SQL against the underlying database.
  • Output ontology – many existing projects can produce an automated domain ontology based on the relational database schema. Generally, a domain class is created for each table and a domain property is created for each column. The key columns are used to generate a resource identifier for each row of each table. Additionally, many of the existing projects allow you specify a custom mapping into a pre-existing domain ontology.
  • Mapping definition – when specifying a domain mapping, projects use a wide variety of forms: DDL, rules, ontologies, custom DSLs, etc.

We were looking for a solution that had the following features:

  1. Dynamically translates queries. Moving the data in an ETL process is not an option due to data volume and security.
  2. Leverages existing relational schema info. We want to leverage information already in the database schema without needing to repeat it in the mapping language.
  3. Minimizes repetition within the mapping. Repeating URI patterns or other info in the mapping is a common source of errors during initial modeling and evolution.
  4. Exposes full power of SQL. SQL supports virtually any possible transformation and we want to be able to bring all of that power to bear when specifying a mapping. In particular, we want to have the ability to use aggregates, subqueries, database functions, etc.
  5. Allows analysis of the mapping itself with semantic web tools. We have found a number of use cases for analyzing and validating the mappings themselves with SPARQL or rules. This allows us to do project management tracking, dependency analysis, impact analyis, etc.

Some existing projects that provide database mapping capabilities include D2RQ, Virtuoso, Triplify, Asio, SquirrelRDF, R2O, Dartquery, SPASQL, RDBtoOnto, and many more. Of those, D2RQ seems to be the most commonly used solution and we have built some deliverables using it ourselves. D2RQ satisfies many of the things we were looking for, but fell a bit short in areas like minimizing repetition and in particular providing mappings sufficient for analysis. D2RQ refers to database information in strings of SQL without any other explicit linkage to the database schema.

Due to these deficiencies, we have created our own mapping language. It shares some of the ideas present in D2RQ, however it departs significantly from D2RQ in the details. One major difference is that the Revelytix mapping language depends on importing the database schema and building a source ontology that represents the tables, columns, keys etc from that source. The mapping ontology then references that source ontology.

One major benefit of this strategy is that many of the ontology references to columns and keys are done by pointing to a resource in the database ontology rather than indicating that reference in text. This reduces the chance of errors in the mapping, makes the mapping process easier in an editor, and most importantly, exposes actual linkages from database to domain. We can then write SPARQL queries against the mapping ontology that provide answers about data lineage.

We are very interested in hearing feedback about the choices made in this mapping language. You can find the full mapping language specification (still a work in progress) at this location or view the actual ontology and some examples in our public Knoodl repository. We have an evolving prototype implementation called Spyder that uses these mappings to perform dynamic SPARQL to SQL translation. In the future, we hope to be able to release Spyder as an open source project. Spyder is a key component in a larger semantic federation engine we are building.

Comments

9 Responses to “Mapping relational data to the semantic web”
  1. mark says:

    I’m not sure I understand why RDF was needed to solve these problems. From the description and the presentation and the need to keep SQL expressibility, it looks like a pure federation scenario. Would buying a top tier federation product like Composite solve the problem? Could you elaborate further on the need to add RDF to the mix?

  2. Alex says:

    @mark: That’s a fair question. I ran the team at MetaMatrix building their federation query engine so I am quite familiar with the approach. From a pure data point of view, there aren’t obvious benefits of one vs the others. However, at a higher level, there are a couple of benefits. The richness of ontologies vs other modeling techniques allows for expressing new kinds of information and supports the addition of reasoners over those data sets. Also, using ontologies has so far been less brittle than the same approach using relational models.

  3. Juan Sequeda says:

    Hi Alex,

    I’m an invited expert on the W3C RDB2RDF working group. This is a great post and highlights a lot of the things that we are currently discussing, specially the need of creating an ontology from the database schema.

    Is you mapping language available?

    Have you seen our use-case document?

    http://www.w3.org/2001/sw/rdb2rdf/use-cases/

    We would really appreciate it if you could check it out and give us feedback.

    Thanks!

  4. Alex Miller says:

    @Juan: I linked above to our site that hosts the specification and the mapping language if you’re interested in checking it out.

    I have seen the use case doc and I thought it was pretty good, other than not capturing the requirements I listed above. :)

  5. michael lang says:

    In regard to Mark’s comment at the top of this list, the reason we use RDF and OWL for this solution is that various communities need to model their domains using a standard. RDF and OWL are by far the best standards for describing domains with enough precision and extensibility to drive a runtime execution engine. If Composite or any other commercial products were able to read these RDF descriptions and do the work necessary to accomplish integration and analytic capabilities, they would be used.

    As Alex said in his reply above, we found at Metamatrix that the lack of a standard based modeling approach crippled the federation capabilities of the tool. RDF and OWL are naturally federated technologies and are perfect for the use cases of federating information domains within enterprises, across enterprises and across the web.

  6. Juan Sequeda says:

    @Alex

    Thanks! Got it off the email you sent to the list also :)

    It would be great if you could comment on the use case and list specific requirements that are lacking in the current document during your presentation next week.

    Thanks!

  7. Hi Alex, great post, very informative. I designed and implemented much of D2RQ, and am a member of the RDB2RDF working group. This is useful input.

    I have two questions. First, can you expand a bit on your item #2, “leverages existing relational schema info”? What do you mean by that? Can you give examples of things that you wouldn’t want to put into a mapping file because they are already in the schema?

    Second, on your fifth point. I understand this as saying that you want the mappings to refer to components of the relational schema (tables, columns etc) in a way that makes it possible to, for example, find all the mapping parts that refer to a particular column in the DB. If I’m way off, please correct! If I got this right, then doesn’t this conflict with the fourth requirement? If your mapping contains complex SQL transformations, then the information about the accessed tables and columns is buried somewhere in the SQL strings, and is no longer easily accessible via SPARQL queries for analytics.

    You mention D2RQ’s design decision to refer to columns as literal strings, where you seem to prefer URIs—I imagine a reference to the “users.email” column as “some_uri_prefix/dbname/users/email” or something along those lines. The issue here is character escaping. The set of allowed characters in SQL identifiers and in URIs are quite different, and to make matters worse, each RDBMS comes with its own byzantine set of escaping rules. Expecting mapping authors to correctly write those URIs seemed unrealistic as soon as anything beyond alphanumeric characters are involved, so we went for literals. Do you have any thoughts/experiences on this?

    I haven’t read the Revelytix RDB mapping spec yet, but will do so over the next couple days, forgive me if I’m asking something that should be obvious from that document.

    Again, thanks for the thoughtful post!

  8. Alex Miller says:

    Richard,

    On leveraging existing relational info, I’m talking about things like database column types – if those are in the db, then I shouldn’t have to say that in the mapping file. IIRC, listing which columns are dates is an example of something you might need to specify in a D2RQ mapping. Or perhaps a better example would be a foreign key relationship. If at all possible, I would like the mapping to implicitly understand that a couple in a table that is an FK to a PK should use the same URI pattern used by the referred PK.

    On analyzing mappings, this is an evolving picture. We initially started with the idea of making all relational linkages very explicit and tied to columns, etc in a structured way. As we’ve worked with it, we’ve moved away from having all mapping info be explicit (as that’s often very verbose) and more towards something that leaves more implicit and does leverage chunks of SQL better. We’re still trying to find the happy medium there.

    On columns as URIs, we are really doing both right now – referencing columns directly via a URI when possible and as simple names when using SQL expresssions in some known context. We sort of pre-suppose the existence of a design tool that makes picking a URI easier. WRT the escaping rules, as long as we can define some 1-way name to URI mapping and retain the original name via a label on the resource, I don’t think that will be an issue. At MetaMatrix, we implemented all of those byzantine translation rules and while tedious, it worked out.

    Hopefully we can talk more next week!

  9. B Berger says:

    I am the Metadata Management lead contractor for a project very similar to yours, at USDA. You might consider an OBIEE front-end on top of the existing tables and files. Also, the Informatica Metadata Manager can be very useful on the way to an integrated view of data but there is the problem of integrating it with the front-end.

Speak Your Mind

Tell us what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!