A Semantic Web Primer for Object-Oriented Software Developers

No W3C Status, $Date: 2005/09/19$

This version:
http://www.knublauch.com/oop/2005/09/19/
Latest version:
N/A
Previous version:
http://www.knublauch.com/oop/2005/09/12/
Editors:
Holger Knublauch, Stanford University and the University of Manchester, <holger@smi.stanford.edu>
Daniel Oberle, Universität Karlsruhe, <oberle@fzi.de>
Phil Tetlow, IBM, <philip.tetlow@uk.ibm.com>
Evan Wallace, National Institute of Standards and Technology, <ewallace@cme.nist.gov>

  1. Introduction
  2. Application Development with Semantic Web Technology
  3. Introduction to RDF and OWL
  4. Programming with RDF and OWL
  5. Appendix: Where to go from here

1 Introduction

Almost all software systems are centered around a domain model. Domain models describe the relevant concepts and data structures from an application domain and encode knowledge that is useful to drive an application's behavior. For example, assume that our task is to develop an online shopping system. In the requirements analysis for this system, we would learn that

After a bit of thinking we may come up with an object-oriented design such as the following UML diagram.

Figure1: A simple domain model in UML syntax.

We can present this UML diagram to our customer and after a few iterations, we may end up with a data structure that can be implemented in our favourite programming language. We may also begin with user interface components for end-users (maybe Java Server Pages) and for the online shop managers (who may require a more sophisticated front-end implemented in Java/Swing, C# or Visual Basic). If our system is very successful, more components will be built around it, for example to access the product catalogue through a Web Service. In these cases, other components would want to share the same data structures and domain knowledge so that they can interoperate. If our system is not so successful or the system it ported to a different platform, we may at least want to reuse parts of it. In these cases, it would be useful to have access to the underlying domain model so that we can extract those parts that we need.

Since we anticipate all these future developments, we design our system with a Model-View-Control architecture. This well-known pattern suggests to separate domain models from user interface and control logic. The separation of non-visual parts from visual components makes it potentially easier to reuse and share domain models for other applications and target platforms. Unfortunately, the promise of reusability of object-oriented models is often not fulfilled. In many cases, domain models like the above contain hard-coded dependencies with the specific application. Especially once the model is encoded in a programming language like Java, much of the knowledge that went into the initial design is lost: the condition whether a purchase order is duty free may be encoded by if-statements deep in some imperative method (such as totalPrice() in the UML diagram), and the fact that each PurchaseOrder requires at least one product will also not be clear unless you care to read through the control logic of the user interface implementation. Another typical problem with such systems is interoperability. For example, if some other application wants to interface data or services from our system, it would need to go through a well-defined interface (API) that is strictly coupled with our application. Maybe an intermediate format based on XML is used to exchange information between such applications. If multiple applications with similar tasks shall interoperate, a large number of such interfaces and exchange formats will be needed.

The highest potential for reuse and interoperability in our example scenario would have the UML diagram. The UML model is on a higher level of abstraction and could be used to derive implementation code for various purposes. However, even if two components or applications have started with the same UML diagram, they may have incompatible implementations. Still a lot of hand-coding will be necessary to implement them. In which format shall customer data be stored and shared? The UML model may be ambiguous and misunderstood. In one implementation, countries may be stored as string values, while others may want to represent them as instances of a Country class. In either case, it is unclear where and how the specific countries such as Germany and France shall be represented in UML. Furthermore, UML diagrams are typically only maintained as intermediate artifacts in the development life cycle, used as the foundation for the implementation but then put into drawers, where they are inaccessible to other developers. UML models are often hidden for a good reason, because they may no longer be up to date with the real implementation. The result of this software development reality is that much time is wasted with unnecessary duplicate work. Domain models need to be crafted from the scratch, and then mapped into intermediate formats to share data between applications.

In an ideal world, developers would discover shareable domain models and knowledge bases from a repository and then wire them together with the remaining object-oriented components for user interface and control components. All applications that share overlapping domain models would have a certain degree of interoperability built-in. While this ideal world is still a vision, some promising approaches start to arise.

Rather unnoticed from the main software engineering camps, the World Wide Web Consortium (W3C) has designed some very interesting technology in the context of its Semantic Web vision. This technology, including RDF and OWL, has been originally designed with the goal of making internet pages easier to understand for intelligent agents and web services, but it turns out that Semantic Web languages and tools could also play a major role in software development in general.

In a nutshell, Semantic Web based development suggests to design domain models in Web-based object-oriented languages such as OWL and RDF. OWL has been optimized to represent structural knowledge on a high level of abstraction. Domain models encoded in OWL can be uploaded on the Web and shared between multiple applications. The OWL models themselves encode much of their meaning (also known as semantics), so that applications can discover and access appropriate models dynamically. The richness of the Semantic Web representation languages makes it easier to build reusable, quality domain models, because additional reasoning services such as consistency checking and classification can be exploited. At the same time, OWL and RDF operate on similar structures like object-oriented languages, and therefore can be relatively seamlessly integrated with traditional software components.

The purpose of this document is to explain how object-oriented applications can be designed and implemented with the help of Semantic Web technology. Section 2 gives an outline of how the application development life cycle can benefit from Semantic Web approaches. Section 3 introduces the Semantic Web languages RDF and OWL, and compares them to object-oriented modeling languages. Section 4 shows how RDF and OWL models can be embedded into object-oriented programs (here, using Java). Section 5 provides references to further reading, tools and libraries.

2 Application Development with Semantic Web Technology

What is the Semantic Web? Most of the current internet content is geared for human users. Presentation languages such as HTML contain instructions for Web browsers on how to present multi-media contents to humans. However, if we wanted to employ a computer program to search for Web-based information for us, then this program would find it very difficult to make any sense of these Web pages, unless it has advanced human language skills. Furthermore, contemporary Web languages like JSP or ASP support a random mixture of model and view parts in a single file, leading to very unstructured contents.

The vision behind the Semantic Web is to make internet contents machine-readable so that it can be easier analyzed by software agents and shared between Web Services. For that purpose, the World Wide Web Consortium (W3C) is recommending a number of Web-based languages that can be used to formalize internet contents. RDF and OWL can be used to describe classes, attributes and relationships similar to object-oriented languages. For example, RDF can be used to define that the class Product has a property hasPrice which takes values of type float. And you can define a class Purchase with a property hasProducts which relates it with multiple Products. OWL extends RDF by additional constructs to define more complex relations. For example, OWL can be used to define a class DutyFreeOrder as the subclass of all purchases that have a delivery address to a country that is known to have a free-trade agreement. The W3C also works on other languages for describing if-then rules and complex SQL-like queries, but our focus here lies on RDF and OWL.

Domain models in any of these languages can be linked into the Semantic Web just like you would publish an HTML page. Once an RDF or OWL file is online, other Web resources or applications can link to them. For example, a HTML page showing a certain product could encode metadata to link back to the corresponding entity in an RDF model. Or, providers of certain products can instantiate the RDF classes to announce their portfolio to shopping agents. A typical scenario for such a Semantic Web application is shown in Figure 2.

Figure 2: An application using Semantic Web technology can exploit domain models and services from the Web.

While some of this could also be achieved using traditional XML-based approaches, Semantic Web languages are far more flexible and extensible. Since their basic structure is in a sense object-oriented, it is possible to define subclasses and generalizations of concepts. Since every Semantic Web resource has a unique URI, it is possible to establish links between existing models. This means that whenever a model of a certain domain has been published on the Web, then others are able to build upon it, and thus to establish a network of domain knowledge.

The extensibility of Semantic Web languages supports reusability on a global scale. Instead of defining the 1000th variation of a product-purchase domain model, application developers could locate a suitable model from the Web and simply reuse or extend it. By reusing an existing model, different applications with similar tasks can share results and data much easier. Furthermore, it is far more likely that an application-independent reusable component (such as a shopping basket application or a credit card handling Web Service) can be integrated.

This reusability is partly based on the fact that Semantic Web languages are Web-based: each class, property or object in an RDF or OWL file has a unique identifier (URI), so that it can be referenced from anywhere else. The other major strength that makes Semantic Web models better to reuse is that OWL is founded on formal logic. This means that OWL models are not only limited to defining classes and their attributes, but can also encode the intended "meaning" of these classes, so that the classes can be unambiguously shared between groups of humans or machines. Domain models that are based on such well-defined logics are often called ontologies. In fact, the abbreviation OWL stands for the "Web Ontology Language". From an object-oriented point of view, ontologies are domain classes that contain logical statements that make their meaning explicit. We will show later that tools called reasoners can exploit these logical statements to perform advanced queries which reveal implicit relationships between resources.

Ontologies and domain models often span different levels of application-dependency and reusability. Revisiting the example from the introduction, statements 1. and 2. specify data structures to represent customer and purchase data. Statements 3., 4., and 5. are about specific countries, which could be used for geographical or political applications. Statement 6. is independent from these specific countries, and describes general domain relationships for countries that fulfill certain criteria. These parts should be made reusable or reused from standardized solutions. In fact, ontologies are often defined by groups of humans (such as an e-commerce consortium or a national geological survey) in order to build a shared domain vocabulary for information integration. Once a standardized ontology of countries and their relationships exists, there is no need to reinvent the wheel for each individual application. Furthermore, reusing an existing ontology from the Web has the advantage that the application would more directly benefit from updates such as new countries.

However, the specific customers and our online shop's localization to a specific country are application-specific and need to be custom-tailored. Such custom-tailoring can be achieved by adding specific subclasses or instances. If shared ontologies / domain models are not optimized for a specific application purpose and therefore need to be adapted or built from the scratch, then domain modeling tools (such as Protege, as shown in Figure 3) can be used. These tools are suitable for domain experts who have little or no training in programming languages. Essentially, these tools provide visual editors for classes and relationships, and allow users to create instances of these classes. [@@ Note: I definitely don't insist on putting a Protege screenshot here (I am involved in Protege development myself) - I don't want to exploit this note for cheap propaganda. However, I think it would be invaluable for readers of this note to see that real tools exist and how they compare to UML-based modeling tools. Any other suggestions for tools, screenshots are welcome].

Figure 3: Domain modeling tools such as Protege can be used to define classes, properties and individuals.

The domain modeling activities in such a development process can be compared to requirements analysis and design steps in traditional software development. The domain experts join forces with software designers to come up with suitable abstractions of a domain. Ontologies from the Web are combined, extended and instantiated. Ontology development tools provide facilities to instantiate classes, so that example instances can be created and prototyped. The resulting domain models are then combined with the remaining application components such as user interface and control logic by programmers. In contrast to traditional object-oriented design methodologies, where analysis and design only leads to intermediate artifacts for code generation, the Semantic Web approach uses the same models for all stages from analysis, design, implementation to testing and even at run-time. The ontologies defined in the early phases determine the classes in the implementation, but at the same time the original design models remain accessible when the application is executing. The formal logic behind ontologies can then even be exploited to drive test cases. If domain models have explicit run-time semantics, it is possible to use reasoning services. We will look into this in more detail after we have introduced the basics of RDF and OWL.

3 Introduction to RDF and OWL

In order to implement the Semantic Web vision, the W3C has produced a number of language specifications. RDF is the base infrastructure to represent classes, properties and instances in a Web compliant format. OWL extends RDF with richer expressivity. Both languages are now supported by tools, parsers and programming APIs. This section will introduce RDF and OWL and compare them to object-oriented languages.

3.1 RDF and RDF Schema

RDF (Resource Description Framework) is a Web-based language that can be used to describe relationships between resources. A resource can be anything with a Uniform Resource Identifier (URI). Since they have URLs, HTML pages, images, and multi-media files are resources. In RDF, resources can also be classes, properties or instances. For example, the URI http://ecommerce.org/ecommerce.rdf#Product could represent a class in an RDF file, and you could use this URI to annotate a Web page of a certain product.

RDF just defines the very basic syntax for Semantic Web contents, and has an XML serialization that allows users to share models on the Web. RDF Schema defines an object-oriented model for RDF. RDF Schema defines how classes, subclass relationships, properties, datatypes etc. are represented. For example, the following RDF Schema file declares a class Product and a property hasPrice:

<rdf:RDF xml:base="http://ecommerce.org/ecommerce.rdf"
         xmlns="http://ecommerce.org/ecommerce.rdf#"
         xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

  <rdfs:Class rdf:ID="Product"/>
  <rdf:Property rdf:ID="hasPrice">
    <rdfs:domain rdf:resource="#Product"/>
    <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#float"/>
  </rdf:Property>
</rdf:RDF

It is far beyond the scope of this paper to introduce the details of RDF, RDF Schema or the XML syntax. For the purpose of this document only a few concepts are important. URIs are often split into namespace and local name, and the namespace can be abbreviated with a prefix notation. For example, rdfs:Class is the abbreviation of the URI http://www.w3.org/2000/01/rdf-schema#Class if the prefix rdfs has been declared in the head of the file. If no prefix is given (such as in "Product"), then the default namespace of the file is used. In order to simplify the presentation in this document, we will focus on this short notation based on the local names.

Namespaces can be compared to packages in object-oriented languages. The file above can therefore be regarded to define the package http://ecommerce.org/ecommerce.rdf#. All resources declared in a namespace are public, so that all RDFs files could directly refer to each other. For example, you could create another RDF file that defines an instance of the Product class above, and fill the object with a specific price. Such instances are called individuals in RDF. In contrast to object-oriented languages, individuals can have more than one type. For example, the individual http://myshop.com/products.rdf#Harry could be declared to be both an instance of http://ecommerce.org/ecommerce.rdf#Product and of http://auctioning.org/model.rdf#AuctionItem. This would make it possible to use the same resource (denoted by its URI) in one context as a product and in another context as an auction item.

RDF Schema classes are sets of individuals with shared characteristics. Classes can be arranged in a subclass hierarchy very similar to object-oriented systems. Like UML, RDF Schema supports multiple inheritance. A major difference between RDF and object-oriented languages is that all classes can overlap. Since individuals can have multiple types, this means that that some instances may be shared among classes. Furthermore, instances can change their type during their life cycle. A purchase order may start as a plain instance of the PurchaseOrder class, and later change its type to DutyFreeOrder when the program has collected more information about the customer's delivery address.

Another important difference between Semantic Web and object-oriented languages is that the Semantic Web is an open world, in which files can add new information about existing resources. Since the Web is a huge place in which everything can link to anything else, it is impossible to rule out that a statement could be true, or could become true in the future. For example, if we define a class, we usually cannot know all its instances in advance. Likewise, we cannot rule out that a certain Product will also be used as an AuctionItem. This "open-world assumption" means that modeling domains for the Semantic Web may require a bit of training for people who are used to the closed, finite world of classical object-oriented systems or traditional databases. On the other hand side, it offers a bounty of flexibility and an open world of opportunties for reuse and interoperability.

But let us turn back to the RDF language for now. RDF properties can be compared to attributes, fields or relationships in object-oriented languages. However, while in UML or Java attributes are attached to a single class only, RDF properties are stand-alone entities which can be defined independently from classes and used in multiple classes. For example, you can define a property hasPrice and then attach it to all classes where a price makes sense. This also makes it possible to reuse the same property across multiple files. For example if you create a model for online auctioning software, you could use the price property from the online shopping model to also represent prices for the auction items. Sharing the same property across multiple models means that values can be more easily integrated, for example to compare the current auctioning price with the price for a new product in other online shops.

In order to "attach" a property to a class, rdfs:domain statements can be used. rdfs:domain is a system property from the RDF Schema namespace that relates a property to a class. In the example file above, the domain of hasPrice is Product. From an object-oriented point of view this would mean that all instances of the Product class can have values for this property. However, in RDF and OWL this also has an additional meaning: any resource that has a value for the hasPrice property must also be a Product. In other words, a domain statement in RDF can be used to classify instances: if something has a price, then we can handle it to be an instance of Product, even if it has been officially declared to be an AuctionItem only. We will revisit this crucial difference later, in the context of reasoning with OWL.

By the way, primitive values such as prices are called literals in RDF, and literals have an XML Schema datatype such as xsd:string or xsd:float as their types. It is possible to limit the type of values of a property using an rdfs:range statement. A property can either have an XML Schema datatype as its range, or a class. Properties that have classes in their range can be compared to relationships in object-oriented languages. For example, if the property hasCustomer has the range Customer, then all values of the property must be customers. Similar to domain statements, range statements can also be interpreted the other way around: if we know that a resource is related by means of the hasCustomer property, then we can infer that the resource is in fact a Customer, even if it has other types as well.

3.2 OWL

As shown in the previous paragraphs, RDF defines a simple domain modeling schema similar to object-oriented languages. You can define classes and their properties and then create instances of these classes. However, there is little beyond that, and RDF alone would be a rather poor domain modeling language. For example, RDF cannot express cardinality constraints so that each Product can only have one price.

The Web Ontology Language (OWL) extends RDF Schema and uses the same RDF syntax as its base platform. OWL adds language elements to express complex logical relationships between classes and properties. The central building blocks of these relationships are so-called restrictions, which are used to describe the characteristics of the property values at a certain class. OWL supports various types of restrictions:

It is important to understand that OWL restrictions are themselves classes (so-called anonymous classes). Like named classes, anonymous classes also describe sets of individuals. The instances of a restriction are those individuals that fulfill its constraints. Regarding classes as sets is the foundation for more complex OWL definitions. OWL provides class constructors to define intersections, unions and complements or other classes. For example, you can define the class GermanBookLovingCustomer to consist of those individuals that are in the intersection of GermanCustomer and BookLovingCustomer. If you use restrictions instead of named classes, you could this way build expressions of arbitrary complexity. For example, you could define the class of all customers from France who have issued at least 3 purchase orders or at least one order consisting only of books, except those customers who have ordered a DVD.

@@ Here we may have a Venn-Diagram based example to illustrate these concepts

Again, it goes far beyond the scope of this document to explain these logical concepts in detail, and we point to some introductory material in the appendix. The key point here is that OWL can be used to define classes by means of logical statements about their members. In object-oriented systems, such statements would typically have been hidden somewhere inside the code base. In Semantic Web ontologies, the logical relationships are made explicit through OWL class definitions and other formal statements. This does not only make it easier for other human users of your model to understand the intended meaning, but also means that other tools can use the definitions transparently. OWL models simply declare things, and it is entirely up to the applications to do something useful with these declarations.

Some of these Semantic Web applications can exploit other tools to handle and analyze OWL models. One family of such tools are called reasoners. A reasoner is a service that takes the statements encoded (asserted) in an ontology as input and derives (infers) new statements from them. In particular, OWL reasoners can be used

Here is an example: Assume you have defined a class DutyFreeOrder, which consists of all PurchaseOrders that have been issued by customers who belong to the set of all customers that live in a free-trade country. Now assume, a new user logs into the online shop and starts putting items into his shopping basket. Internally we will create blank instances of the Customer and PurchaseOrder classes. Later, when the user proceeds to the check out and enters his delivery address, we can ask a reasoner to classify the PurchaseOrder. This will give us the most specific class that the particular order belongs to (here, a DutyFreeOrder).

[@@ Add another example for class-based reasoning and consistency checking]

In contrast to object-oriented systems, where objects normally cannot change their type, applications based on Semantic Web technology can follow a rather dynamic typing system. RDF and OWL classes themselves are also dynamic: it is possible to create and manipulate classes at run-time. For example, one could define a class [@@ example?] and then ask the reasoner about the instances of this class. This means, that reasoners can be compared to rich query answering systems. These queries can be asked at ontology design time, but also at execution time.

3.3 Comparison of OWL/RDF and Object-Oriented Languages

Summarizing the introduction of RDF and OWL, the following table shows important differences between Semantic Web languages and object-oriented languages:

Object-Oriented Languages
OWL and RDF
Domain models consist of classes, properties and instances (individuals). Classes can be arranged in a subclass hierarchy with inheritance. Properties can take objects or primitive values (literals) as values.
Classes are regarded as types for individuals. Classes are regarded as sets of individuals.
Each individual has one class as its type. Each individual can belong to multiple classes.
Classes cannot share instances. Individuals can belong to multiple classes.
Individuals can not change their type at run-time. Class membership may change at run-time.
The list of classes is known at compile-time. Classes can be created and changed at run-time.
Compilers are used at built-time. Compile-time errors indicate problems. Reasoners can be used for classification and consistency checking at run-time or build-time.
Properties are attached to one class (and its subclasses through inheritance). Properties are stand-alone entities that can exist without specific classes.
Instances can only take values for the attached properties. Values must be of the correct types defined for the properties. Any instance can take arbitrary values for any property, but this may affect what reasoners can infer about their types.
Classes encode much of their meaning and behavior through imperative functions and methods. Classes make their meaning explicit in terms of OWL statements. No procedural attachment is possible.
Classes can encapsulate their members to private access. All parts of an OWL/RDF file are public and can be linked to from anywhere else.
Closed world: If something is not part of the model, then it is assumed to be false. Open world: If something is not part of the model, then it may be true or false.
... ...
Some generic APIs are shared between applications. Few (if any) UML diagrams are shared. RDF and OWL have been designed from the ground up for the Web. Domain models can be shared online.
Domain models are designed as part of a software architecture. Domain models are designed to represent knowledge about a domain, and for information integration.
UML, Java, C# etc. are mature technologies supported by many commercial and open-source tools. The Semantic Web is an emerging technology with some open-source tools and a handful of commercial vendors.
UML models can be serialized in XMI, which is geared for exchange between tools but not really Web-based. Java objects can be serialized into various XML-based intermediate formats. RDF and OWL objects have a standard serialization based on XML, with unique URIs for each resource inside the file.
   

4 Programming with RDF and OWL

Appendix: Where to go from here