Database

For those familiar with RDF you understand that it can be described by using entity-triples - [subject, predicate, object]. This is the usual way of storing RDF in relational databases, one table containing all triples (statements using RDF terminology) where its columns contains the three entities either directly or by reference to another entity table. The first approach is theoretically faster since it does not require two tables to be joined to extract a full statement. However, since the statements generally contains a lot of redundant data (the object in one statement is the subject in another) this requires far more persistent data to be allocated, i.e. the database becomes larger. Similarly, the objects of different statements often contains the same value. In the other approach using two tables, a value is only stored in one instance no matter how many statements that contain it. This affects the performance of freetext-searches possitively since fewer strings has to be examined.

Jena uses this second approach with the extension that every statement "belongs" to a certain model (graph). The models are separated in the API, but not in the actual storage. In other words you can only query one model at a time using the Jena Model API, since the original model is not part of a Jena Statement. This has proven to be a problem for SCAM since we want a way to easily query all models as fast as possible without having to iterate through all of them. The solution is to incorporate model belonging in the statement - the triple becomes a quadruple. From this we can determine from which model each statement comes from in the repository implementation. In order for this to be available by using the ordinary Jena Model API, a SCAM.inModel-property is appended to all Components describing in which model it is located. This is also the only way to preserve model belonging using RDF/XML.

Currently SCAM works with the following RDBs (see Chapter 2, Configure on how to configure a database):

It is however not too complicated to make SCAM compatible to other SQL-92 complyant RDBs. The current implementation of the Jena ModelRDB API is located in scamDrutten, a package extending the original Jena implementation with RDF helper classes and a modified RDB interaction.

There are some limitations of this implementation. The database tables are not "garbage-collected" correctly, which means that we have a kind of memory-leakage in the persistant storage. Specifically the 'entities' table is not cleaned when a statement is removed. In the long run this will effect the performance of the Repository since the 'entities'-table will become exceedingly large. The original Jena database layout suffers from the same illness. We see three possible solutions to this problem:

  1. Make sure that referenced entities are removed when the referencing statements are. This is time-consuming since one entity can be referenced by many statements - we have to traverse the entire statement table in order to find how many statements that references the entity at run-time.

  2. Implement a separate garbage-collector thread that will run when the system is either halted or the traffic is low.

  3. Change the database-layout. Currently awaiting the final release of Jena2 and performance/scalability testing. Hopefully this release will solve some of the issues troubling Jena1.

Another issue is the performance of database interaction. The existing RDB-based RDF repositories are simply not fast enough, the execution time of RDF-queries and -extraction must improve, especially as the amount of statements increase. The main problem is that it is complicated to traverse graphs represented as a flat structure (RDB tables). Consider a Component X consisting of a complex graph, including several intermediate bNodes. If we want to extract X, we have to first extract all statements having X as subject, then for each bNode extract all statements having it as subject - iteratively - until we have traversed all branches of X. This is the prize you will have to pay using a relational database for storage. In order to minimize time consumption you have to optimize the atomic methods for extracting statements. One way is to avoid joining tables, i.e. keep the entities in the statement table, not just references to them. This would decrease time consumption at the cost of increased memory usage since an entity may be stored multiple times. Jena2 uses this second approach and measurements show that in some cases it is three times faster than its predecessor.