Writing Iterative Algorithms in SPARQL: Pt. 2

SteveIn the first post of this series we looked at what iterative algorithms are. In this post we’ll look at the actual SPARQL queries that implement the peer-pressure clustering algorithm.

The algorithm starts with a set of vertices and edges between pairs of them. The code, simply stated, is:

assign each vertex to an initial cluster
do {
	assign each vertex to the most popular cluster of its neighbors
} while (enough vertices changed clusters in this iteration)

Each statement results in a query. We chose to put each vertex in an initial cluster named the same as the vertex. The SPARQL query for this looks like:

[ 1] DROP GRAPH <urn:ga/g/xjz0>
[ 2] CREATE GRAPH <urn:ga/g/xjz0>
[ 3] INSERT {
[ 4]       GRAPH <urn:ga/g/xjz0> {?s <urn:ga/p/inCluster> ?s }
[ 5] }
[ 6] WHERE {
[ 7]    SELECT DISTINCT ?s WHERE {
[ 8]      ?s <urn:ga/p/hasLink> ?o .
[ 9]    }
[10]}

The cluster assignments are extra data that we’re adding to the database.  To avoid cluttering up the default graph and make it easy to find them, we put them in named graphs chosen for this execution of the algorithm (“xjz” in the examples), so their names do not collide with other executions of the algorithm.  In lines 1 and 2, we delete via the DROP construct any graph of that name and CREATE a new empty graph.  The SELECT clause on lines 7-9 finds all vertices in the default graph that are the subject of a hasLink predicate and, for each unique such vertex, then on lines 4-6 INSERTs into the named graph a new triple of the same subject, the inCluster predicate, and the subject name as the cluster name.

We have used URIs starting with the urn: prefix so we can freely make up a namespace with no concern of collisions with other usage of the same strings elsewhere.  These URIs will not be dereferenceable, which in this situation is not a problem.  In <urn:ga/p/hasLink> and <urn:ga/g/xjzi>  the “ga” stands  for graph analytics, the “p” for predicate, and the “g” for graph name.

Typically the second query and the third query will execute repeatedly.  The second query calculates new cluster assignments based on current cluster assignments.

[11] DROP GRAPH <urn:ga/g/xjzi+1>
[12] CREATE GRAPH <urn:ga/g/xjzi+1>
[13] INSERT
[14] {
[15]   GRAPH <urn:ga/g/xjzi+1>  { ?s <urn:ga/p/inCluster> ?clus3 }
[16] }
[17] WHERE {
[18]   { SELECT ?s (SAMPLE(?clus) AS ?clus3)
[19]   {
[20]     { SELECT ?s (MAX(?clusCt) AS ?maxClusCt)
[21]       {
[22]         SELECT ?s ?clus (COUNT(?clus) AS ?clusCt)
[23]         WHERE
[24]         {
[25]           ?s <urn:ga/p/hasLink> ?o .
[26]           GRAPH <urn:ga/g/xjzi> { ?o <urn:ga/p/inCluster> ?clus }
[27]         } GROUP BY ?s ?clus
[28]       } GROUP BY ?s
[29]     }
[30]     { SELECT ?s ?clus (COUNT(?clus) AS ?clusCt)
[31]       WHERE
[32]       {
[33]         ?s <urn:ga/p/hasLink> ?o .
[34]         GRAPH <urn:ga/g/xjzi> { ?o <urn:ga/p/inCluster> ?clus }
[35]       } GROUP BY ?s ?clus
[36]     } FILTER (?clusCt = ?maxClusCt)
[37]     } GROUP BY ?s
[38]   }
[39] }

Lines 22-27, for each vertex, returns the vertex, the cluster assignments of its neighbors, and the count of vertices in that cluster.  Lines 20-29 calculate the maximum cardinality of the clusters of the neighbors of each vertex.  Lines 30-36 calculate the cluster assignment of that maximum-cardinality cluster.  (SPARQL lacks a construct that returns the maximum value of one intermediate result and the corresponding element of another intermediate result.)  Lines 18-38 join the maximum cardinality with the cluster name and also, in the case of a tie in maximum cardinality, break any tie by SAMPLEing a cluster assignment for each cluster.  Lines 14-16 INSERT the new cluster-assignment triples into the named graph.

Because this query reads and writes to graphs whose names change with the iteration count, it is simple to create the queries in a scripting language (JavaScript in this case) to make these modifications to the query for each iteration.  We have chosen a naming scheme of <graphName>i, where i varies across a range, to simplify debugging by examination of intermediate cluster assignments.  An alternative (that is more memory conserving) would use just two named graphs (e.g., current and new) and MOVE new to current at the end of each iteration.  The algorithm does clean up all but the named graph with the final cluster assignments, so the extra memory use is only transient.

[40] SELECT (COUNT(?oNew) as ?vccCt)
[41] WHERE {
[42]   GRAPH <urn:ga/g/xjzi>   {?s <urn:ga/p/inCluster> ?oOld}
[43]   GRAPH <urn:ga/g/xjzi+1> {?s <urn:ga/p/inCluster> ?oNew}
[44]   FILTER (?oOld != ?oNew)
[45] }

The second query executed in each iteration (lines 40-45) counts the number of vertices that changed cluster assignment in the just-completed iteration.  The JavaScript loop that makes the queries then decides whether the algorithm has converged, either by the absolute number or percentage of vertices that changed or by a maximum iteration count.

I have omitted the text of a trivial query done at initialization time to count the number of vertices to be clustered.

Careful readers may note that the inner SELECTs at lines 22-27 and 30-35 are identical.  While developing a complex nested query like this, needing to keep the same code in two spots identical is cumbersome.  SPARQL 1.1 contains no good mechanism to define this code once and reuse it, like a function in a procedural language.  The SQL WITH clause defines by name such a code block that can be executed wherever its results are needed.

The second and third queries above both have minor changes from one instance to the next (e.g., substituting “xjz2”, “xjz3”, … into the graph name).  While these are not hard to cope within JavaScript code that creates the queries, it does mean that the query is literally different each time that it is executed. Hence, the SPARQL endpoint will have to reinterpret the query each time, which could at some point become time-consuming.  SQL’s placeholder capability enables the passing of a value (of a given type) at execution time that is inserted at the placeholder’s position in the query, avoiding reinterpretation.

In this post we have looked at the SPARQL queries that implement peer-pressure clustering as an iterative algorithm.  Next time we’ll look at the JavaScript code that creates the queries and calls the SPARQL endpoint.

Harnessing the Power of Data Discovery: Pt. 2

AdnanIn my last blog entry, I talked about why traditional relational database systems aren’t as flexible for data discovery and had introduced the concept of a graph as a better alternative for analytics. Before diving right into why graph databases are better suited for big data discovery, I want to take a moment to show why tabular (or even columnar) systems don’t lend themselves to relationship analytics.

Now let’s look at a simple transactional database. For the purpose of this simplified example, let’s assume an online company that sells apparel is looking for better targeted ads to improve sales. The database stores the details of people, their addresses, what they purchased, as well as other pertinent information about the users and their transactions. The schema is pretty straightforward in this illustration in fig 1a.

Fig 1a

One could do several types of analytics on this data, like cluster analysis of a geographic region or frequency of purchase. With enough transactions from an individual, you could also do some basic consumer profiling. If you know what you’re looking for, the compact tabular representation is efficient storage wise. However, this efficiency comes at a price – inflexibility. Let’s say your marketing manager thinks that adding social networking data might be useful. You could do it but now you’d have to change your schema to add this extra data, this new “relationship,” as shown below.

Fig 1b

The marketing manager is hypothesis-testing: she might use this data for a while and then realize that adding geographic information would be useful too since a marathon is going to take place in this town. This way, she can figure out which folks and their friends who live in proximity to each other and have purchased running shoes in the past. Adding this new information to the database would require yet another change to the schema. Anytime you decide to explore any additional “relationships,” you need to modify the schema. Changing a schema on a production system is a big deal; it can take several weeks and not something that database admins take lightly. There are better ways of analyzing data without modifying your database, but the point I’m trying to illustrate here is that tables are by design compact; while great for certain things, discovering patterns is not one of those things…

Hence the need and emergence of an alternative data representation format: graphs. Basically, a graph is made up of nodes that are connected to other nodes via an edge. In a graph database, the edge represents the relationship between the data entities represented by the nodes. For simplicity, let’s start with just two nodes, with one node representing an individual’s name and another for his address. This is represented in text as what is known as a triple. Much like how a sentence in English is constructed with a subject, an object, and predicate relating the two, a graph is also constructed using sentences (if you will) known as triples, as shown in Fig 2a. Visually, it’s more convenient to display this as shown in Fig 2b.

Name, lives at, Address

Fig 2a

Once you’ve grasped the simplicity of the model, it’s easy to get expressive with your data. Let’s look how we might represent the above tabular data in a graph database.

Adnan, lives at, 123 Main St
Adnan, bought, Running Shoes
Adnan, Paid, $120
Adnan, gender, male
And so on…

And graphically:

Fig 2b

What does putting data in this format afford us? Now let’s go back to our example of adding social media data as we did in the previous case. Since everything is already expressed as a relationship, adding new relationships is trivial. So adding friend information is as simple as adding that relationship between the nodes:

Adnan, friends with, John

John, friends with, Emma

And again, graphically, this would be represented as in Fig 3, with the new relationships in place:

Fig 3

The solid red lines in Fig 3 are the explicit relationships that were specified. And you can just as easily add more relationships as you need. Just as tabular representations excel at being spatially efficient, graph databases are efficient when it comes to adding and exploring new relationships.

I skipped an important step so that I could show you how a graph database works. The nodes don’t have any meaning as such, but what if you could ascribe meaning to each node that distinguishes the nodes of the same class? All names would belong to a class of type NAME, addresses could belong to a class of type LOCATION, and so on. Essentially, this is defining the ontology of the data, or in other words, the data dictionary.

The beauty of ontologies is that they allow you to assign properties to certain node classes and any node belonging to that class automatically inherits that property whether it is explicitly specified or not. For example, you could define an ontology where classes of type NAME who buy running shoes are also of class RUNNERS. Class of Type RUNNERS buy running apparel, so now you can infer that both Adnan and Emma might be interested in running apparel but not John. Inferencing is one more tool that can help enrich your data by introducing additional relationships. In Fig 3 above, even though I specified just two friendship relationships, I added twice as many new edges (shown dotted). From a marketer’s perspective, if Adnan is John’s friend, then the reverse is also true. So with friendships at least, you can infer that the reverse relationship is also true but that need not always be the case.

Ontologies are a complete topic unto themselves, but once properly defined, they are a very powerful concept and give graph analytics more flexibility, especially when it comes to constructing complex queries. Ontologies can be hierarchical (e.g. the LOCATION type can be expanded to include city, state and country, or to include latitude and longitude). Different ontologies may be needed based on how much you want to drill down into your data. And depending on what data you represent, you may want a different ontology. However, several generic ontologies specific to fields like biology, financial services, etc. have been produced that can be easily modified.

I’d like to add one last item before I wind up this post. Just as many of you may be familiar with the Structured Query Language or SQL for querying relational databases, certain graph databases employ a similar query language named SPARQL1. For those of you familiar with SQL, it is simple and elegant. However, writing a very complex query on a very large dataset can be trying since you have to be familiar with the data layout and other aspects of the database if the performance is to be acceptable. Since data layout is not an issue with graph databases, SPARQL doesn’t suffer from many of the shortcomings of SQL, and offers more expressivity and ease when it comes to writing complex queries.

For the interested reader, a thorough discussion on both Ontologies and SPARQL is available in Dean Allemang’s excellent text The Semantic Web for the Working Ontologist2. Dean also happens to be a guest blogger for us, by the way, and we’re really proud to have him sharing his thoughts with our readers.

This has been a more technical post than I’d normally prefer but understanding some of the fundamental advantages of a graph databases are crucial to why graph analytics is a powerful and different platform. Rather than just taking my word for it, I’ve tried to underscore why this is so.

To reiterate, here are the three takeaways from this post:

  • Relational databases are optimized for a compact representation, and do that very efficiently.
  • Graph databases on the other hand, are very flexible when it comes to adding new relationships
  • Graph database are further enhanced by facilities like ontologies, inferencing, and more expressive query languages like SPARQL.

This post focused on why a graph database makes it easy to add new relationships. Now if you’re thinking ahead, you’re probably wondering about the possibilities once your data is organized as a graph. It’s one thing to explicitly add new relationships you knew existed, but what about finding new relationships in your data that you weren’t aware of? Now, that’s what we call Discovery and more about that in my next blog. Until then, keep the comments and feedback coming.

 

Equality and Inequality in SPARQL

RpbIn previous blog posts I’ve touched a little on equality and inequality in SPARQL; in this post I’m going to look at some of the more confusing aspects of these (and SPARQL expression semantics in general) that can surprise even seasoned SPARQL developers like myself sometimes.

Previously, I introduced the fact that the equality operators in SPARQL (= and !=) represents value equality and inequality. This means that non-identical RDF terms can be considered equal/non-equal if they represent the same value. Consider the following example:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ("1"^^xsd:integer = "01"^^xsd:integer AS ?equals) WHERE { }

The two RDF terms are not the same term, but they both represent the value 1 so the expression evaluates to give the value true:

----------
| equals |
=========
|  true  |
----------

Conversely, the following returns false:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ("1"^^xsd:integer != "01"^^xsd:integer AS ?notEquals) WHERE { }
----------
| equals |
=========
| false  |
----------

Malformed Data

Where things start to get tricky is once you start considering data that is not well-formed. Consider the following:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ("1"^^xsd:integer = "xyz"^^xsd:integer AS ?equals) WHERE { }

The first term is a valid integer but the second term is not, so what do you think the answer should be here?

I bet that most of you are thinking false, right?

Strangely enough you’d be wrong; this actually gives us an unbound value, which probably sets a few heads spinning.

----------
| equals |
=========
|        |
----------

The Operator Mapping

This behavior is due to something in the SPARQL specification which most everyday SPARQL users likely never worry about – the Operator Mapping [1]. What is happening is that the SPARQL engine is selecting different implementations of the operators to use based on the arguments provided to them.

In our first two examples, both arguments are valid numeric terms so the SPARQL implementation will use numeric equality/inequality; hence, we get true/false as appropriate.  In the third example, however, because one argument is an invalid numeric term the implementation cannot select numeric equality and instead selects RDF term equality. This returns true if, and only if, the terms are identical; otherwise, it produces a type error, it is this type error which results in no value for our third query.

SPARQL specifies that an implementation chooses the most appropriate operator implementation available and allows implementations to extend the Operator Mapping with additional datatypes if they so desire. This is why in some implementations you could legally get different results if your implementation understood how to turn “xyz”^^xsd:integer into a valid integer.

Type Errors?

Right now, you are probably wondering what exactly a type error is. Essentially, this is how SPARQL accounts for the fact that data is often messy. Often RDF data is generated or extracted from existing real world data sources that themselves contain errors and inconsistencies; this results in RDF data that often contains those same errors and inconsistencies.

Any expression in SPARQL expects its arguments to conform to certain datatypes; if they do not, then that expression will produce a type error. Type errors propagate up expression trees and may have different effects depending on the type of expression being used.

This is particularly relevant for operators like != because typically SPARQL defines them to be logical negation of their = equivalent, as logical negation of an error is still an error. The best way to see this is to try the != version of our third query:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ("1"^^xsd:integer != "xyz"^^xsd:integer AS ?equals) WHERE { }

This, like the previous query, still returns an unbound value for this reason:

----------
| equals |
=========
|        |
----------

FILTER and Type Errors

Where Operator Mapping and type errors really start to confuse people is in how these interact with FILTER, because FILTER is required to just provide a true/false to a SPARQL engine as to whether it should include/exclude a possible solution. Therefore, FILTER is defined such that a type error is treated as equivalent to a false for the purpose of filtering.

This can lead to some apparent inconsistencies if you are expecting = and != to be reflexive as they would be in most programming languages. For example, let’s turn one of our earlier queries into an ASK, like so:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
ASK { FILTER("1"^^xsd:integer = "xyz"^^xsd:integer) }

This gives us false, which if you’ve followed the explanations up to now you should be expecting. The expression produces a type error, which gets treated as false by FILTER.

So when we use != instead, we will still get false as our response:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
ASK { FILTER("1"^^xsd:integer != "xyz"^^xsd:integer) }

If you weren’t aware of the type error semantics of expression evaluation in SPARQL, then seeing these two queries producing these results is going to make you scratch your head in puzzlement.

Conclusion

Expression semantics in SPARQL, particularly around things that people are used to from other languages like = and !=, can often be confusing to new users and even befuddle experienced SPARQL developers like myself. As long as you are aware of type errors, you should usually be able to understand what is occurring. Remember that if you are using FILTER, type errors are treated the same as false. If you are having trouble understanding what is going on, it can be useful to use a project expression or a BIND instead to see whether a value is being generated or not. If there is no value generated, then you know you got a type error.

Acknowledgements

Big thanks to Paul Gearon and Andy Seaborne for helping me get some of the thornier parts of this straight in my own head so that I could write this blog post.  The example results given in this post are the result of running the queries at http://sparql.org.

[1] http://www.w3.org/TR/sparql11-query/#OperatorMapping