Writing Iterative Algorithms in SPARQL: Pt. 2

SteveIn the first post of this series we looked at what iterative algorithms are. In this post we’ll look at the actual SPARQL queries that implement the peer-pressure clustering algorithm.

The algorithm starts with a set of vertices and edges between pairs of them. The code, simply stated, is:

assign each vertex to an initial cluster
do {
	assign each vertex to the most popular cluster of its neighbors
} while (enough vertices changed clusters in this iteration)

Each statement results in a query. We chose to put each vertex in an initial cluster named the same as the vertex. The SPARQL query for this looks like:

[ 1] DROP GRAPH <urn:ga/g/xjz0>
[ 2] CREATE GRAPH <urn:ga/g/xjz0>
[ 3] INSERT {
[ 4]       GRAPH <urn:ga/g/xjz0> {?s <urn:ga/p/inCluster> ?s }
[ 5] }
[ 6] WHERE {
[ 7]    SELECT DISTINCT ?s WHERE {
[ 8]      ?s <urn:ga/p/hasLink> ?o .
[ 9]    }
[10]}

The cluster assignments are extra data that we’re adding to the database.  To avoid cluttering up the default graph and make it easy to find them, we put them in named graphs chosen for this execution of the algorithm (“xjz” in the examples), so their names do not collide with other executions of the algorithm.  In lines 1 and 2, we delete via the DROP construct any graph of that name and CREATE a new empty graph.  The SELECT clause on lines 7-9 finds all vertices in the default graph that are the subject of a hasLink predicate and, for each unique such vertex, then on lines 4-6 INSERTs into the named graph a new triple of the same subject, the inCluster predicate, and the subject name as the cluster name.

We have used URIs starting with the urn: prefix so we can freely make up a namespace with no concern of collisions with other usage of the same strings elsewhere.  These URIs will not be dereferenceable, which in this situation is not a problem.  In <urn:ga/p/hasLink> and <urn:ga/g/xjzi>  the “ga” stands  for graph analytics, the “p” for predicate, and the “g” for graph name.

Typically the second query and the third query will execute repeatedly.  The second query calculates new cluster assignments based on current cluster assignments.

[11] DROP GRAPH <urn:ga/g/xjzi+1>
[12] CREATE GRAPH <urn:ga/g/xjzi+1>
[13] INSERT
[14] {
[15]   GRAPH <urn:ga/g/xjzi+1>  { ?s <urn:ga/p/inCluster> ?clus3 }
[16] }
[17] WHERE {
[18]   { SELECT ?s (SAMPLE(?clus) AS ?clus3)
[19]   {
[20]     { SELECT ?s (MAX(?clusCt) AS ?maxClusCt)
[21]       {
[22]         SELECT ?s ?clus (COUNT(?clus) AS ?clusCt)
[23]         WHERE
[24]         {
[25]           ?s <urn:ga/p/hasLink> ?o .
[26]           GRAPH <urn:ga/g/xjzi> { ?o <urn:ga/p/inCluster> ?clus }
[27]         } GROUP BY ?s ?clus
[28]       } GROUP BY ?s
[29]     }
[30]     { SELECT ?s ?clus (COUNT(?clus) AS ?clusCt)
[31]       WHERE
[32]       {
[33]         ?s <urn:ga/p/hasLink> ?o .
[34]         GRAPH <urn:ga/g/xjzi> { ?o <urn:ga/p/inCluster> ?clus }
[35]       } GROUP BY ?s ?clus
[36]     } FILTER (?clusCt = ?maxClusCt)
[37]     } GROUP BY ?s
[38]   }
[39] }

Lines 22-27, for each vertex, returns the vertex, the cluster assignments of its neighbors, and the count of vertices in that cluster.  Lines 20-29 calculate the maximum cardinality of the clusters of the neighbors of each vertex.  Lines 30-36 calculate the cluster assignment of that maximum-cardinality cluster.  (SPARQL lacks a construct that returns the maximum value of one intermediate result and the corresponding element of another intermediate result.)  Lines 18-38 join the maximum cardinality with the cluster name and also, in the case of a tie in maximum cardinality, break any tie by SAMPLEing a cluster assignment for each cluster.  Lines 14-16 INSERT the new cluster-assignment triples into the named graph.

Because this query reads and writes to graphs whose names change with the iteration count, it is simple to create the queries in a scripting language (JavaScript in this case) to make these modifications to the query for each iteration.  We have chosen a naming scheme of <graphName>i, where i varies across a range, to simplify debugging by examination of intermediate cluster assignments.  An alternative (that is more memory conserving) would use just two named graphs (e.g., current and new) and MOVE new to current at the end of each iteration.  The algorithm does clean up all but the named graph with the final cluster assignments, so the extra memory use is only transient.

[40] SELECT (COUNT(?oNew) as ?vccCt)
[41] WHERE {
[42]   GRAPH <urn:ga/g/xjzi>   {?s <urn:ga/p/inCluster> ?oOld}
[43]   GRAPH <urn:ga/g/xjzi+1> {?s <urn:ga/p/inCluster> ?oNew}
[44]   FILTER (?oOld != ?oNew)
[45] }

The second query executed in each iteration (lines 40-45) counts the number of vertices that changed cluster assignment in the just-completed iteration.  The JavaScript loop that makes the queries then decides whether the algorithm has converged, either by the absolute number or percentage of vertices that changed or by a maximum iteration count.

I have omitted the text of a trivial query done at initialization time to count the number of vertices to be clustered.

Careful readers may note that the inner SELECTs at lines 22-27 and 30-35 are identical.  While developing a complex nested query like this, needing to keep the same code in two spots identical is cumbersome.  SPARQL 1.1 contains no good mechanism to define this code once and reuse it, like a function in a procedural language.  The SQL WITH clause defines by name such a code block that can be executed wherever its results are needed.

The second and third queries above both have minor changes from one instance to the next (e.g., substituting “xjz2”, “xjz3”, … into the graph name).  While these are not hard to cope within JavaScript code that creates the queries, it does mean that the query is literally different each time that it is executed. Hence, the SPARQL endpoint will have to reinterpret the query each time, which could at some point become time-consuming.  SQL’s placeholder capability enables the passing of a value (of a given type) at execution time that is inserted at the placeholder’s position in the query, avoiding reinterpretation.

In this post we have looked at the SPARQL queries that implement peer-pressure clustering as an iterative algorithm.  Next time we’ll look at the JavaScript code that creates the queries and calls the SPARQL endpoint.

Equality and Inequality in SPARQL

RpbIn previous blog posts I’ve touched a little on equality and inequality in SPARQL; in this post I’m going to look at some of the more confusing aspects of these (and SPARQL expression semantics in general) that can surprise even seasoned SPARQL developers like myself sometimes.

Previously, I introduced the fact that the equality operators in SPARQL (= and !=) represents value equality and inequality. This means that non-identical RDF terms can be considered equal/non-equal if they represent the same value. Consider the following example:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ("1"^^xsd:integer = "01"^^xsd:integer AS ?equals) WHERE { }

The two RDF terms are not the same term, but they both represent the value 1 so the expression evaluates to give the value true:

----------
| equals |
=========
|  true  |
----------

Conversely, the following returns false:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ("1"^^xsd:integer != "01"^^xsd:integer AS ?notEquals) WHERE { }
----------
| equals |
=========
| false  |
----------

Malformed Data

Where things start to get tricky is once you start considering data that is not well-formed. Consider the following:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ("1"^^xsd:integer = "xyz"^^xsd:integer AS ?equals) WHERE { }

The first term is a valid integer but the second term is not, so what do you think the answer should be here?

I bet that most of you are thinking false, right?

Strangely enough you’d be wrong; this actually gives us an unbound value, which probably sets a few heads spinning.

----------
| equals |
=========
|        |
----------

The Operator Mapping

This behavior is due to something in the SPARQL specification which most everyday SPARQL users likely never worry about – the Operator Mapping [1]. What is happening is that the SPARQL engine is selecting different implementations of the operators to use based on the arguments provided to them.

In our first two examples, both arguments are valid numeric terms so the SPARQL implementation will use numeric equality/inequality; hence, we get true/false as appropriate.  In the third example, however, because one argument is an invalid numeric term the implementation cannot select numeric equality and instead selects RDF term equality. This returns true if, and only if, the terms are identical; otherwise, it produces a type error, it is this type error which results in no value for our third query.

SPARQL specifies that an implementation chooses the most appropriate operator implementation available and allows implementations to extend the Operator Mapping with additional datatypes if they so desire. This is why in some implementations you could legally get different results if your implementation understood how to turn “xyz”^^xsd:integer into a valid integer.

Type Errors?

Right now, you are probably wondering what exactly a type error is. Essentially, this is how SPARQL accounts for the fact that data is often messy. Often RDF data is generated or extracted from existing real world data sources that themselves contain errors and inconsistencies; this results in RDF data that often contains those same errors and inconsistencies.

Any expression in SPARQL expects its arguments to conform to certain datatypes; if they do not, then that expression will produce a type error. Type errors propagate up expression trees and may have different effects depending on the type of expression being used.

This is particularly relevant for operators like != because typically SPARQL defines them to be logical negation of their = equivalent, as logical negation of an error is still an error. The best way to see this is to try the != version of our third query:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ("1"^^xsd:integer != "xyz"^^xsd:integer AS ?equals) WHERE { }

This, like the previous query, still returns an unbound value for this reason:

----------
| equals |
=========
|        |
----------

FILTER and Type Errors

Where Operator Mapping and type errors really start to confuse people is in how these interact with FILTER, because FILTER is required to just provide a true/false to a SPARQL engine as to whether it should include/exclude a possible solution. Therefore, FILTER is defined such that a type error is treated as equivalent to a false for the purpose of filtering.

This can lead to some apparent inconsistencies if you are expecting = and != to be reflexive as they would be in most programming languages. For example, let’s turn one of our earlier queries into an ASK, like so:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
ASK { FILTER("1"^^xsd:integer = "xyz"^^xsd:integer) }

This gives us false, which if you’ve followed the explanations up to now you should be expecting. The expression produces a type error, which gets treated as false by FILTER.

So when we use != instead, we will still get false as our response:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
ASK { FILTER("1"^^xsd:integer != "xyz"^^xsd:integer) }

If you weren’t aware of the type error semantics of expression evaluation in SPARQL, then seeing these two queries producing these results is going to make you scratch your head in puzzlement.

Conclusion

Expression semantics in SPARQL, particularly around things that people are used to from other languages like = and !=, can often be confusing to new users and even befuddle experienced SPARQL developers like myself. As long as you are aware of type errors, you should usually be able to understand what is occurring. Remember that if you are using FILTER, type errors are treated the same as false. If you are having trouble understanding what is going on, it can be useful to use a project expression or a BIND instead to see whether a value is being generated or not. If there is no value generated, then you know you got a type error.

Acknowledgements

Big thanks to Paul Gearon and Andy Seaborne for helping me get some of the thornier parts of this straight in my own head so that I could write this blog post.  The example results given in this post are the result of running the queries at http://sparql.org.

[1] http://www.w3.org/TR/sparql11-query/#OperatorMapping

Facing the Challenges of Big Data

YarcData would like to introduce our first guest blogger, Dean Allemang. We’re delighted to have a noted expert on the Semantic Web join us for a series of blog articles discussing how graph analytics is quickly gaining prominence in the enterprise for solving complex business analytics problems.

DeanBig Data is all the rage, especially since it received a new shot in the arm by a recent Gartner report.  Everyone is interested in it, and Gartner [1] has thrown its hat into the ring for a definition of it.  But even so, many people use these words in different ways. What is Big Data, and why all the interest in this topic now?

For many people, Big Data is synonymous with search optimization (for many people, all of data science is synonymous with search optimization).  For commercial websites, visitor behavior equals revenue.  Understanding their behavior is essential for growing a web-based business and staying competitive.  So we gather data about our visitors, their habits, and try to glean some understanding of what does and doesn’t work to turn visits into value.

But for others, Big Data comes from outside a single web site.  Social networking sites provide access to huge amounts of data about user behavior.  If we could access and utilize this data, we could understand users who haven’t even visited our web sites.

You might not know it, if you just read articles about SEO, but there actually is a world outside of web commerce, and it is a big world.  Actually, several big worlds: there is a world of Finance, of Science, of Manufacturing, and these worlds are also data intensive.  Detailed financial behavior has a huge impact on financial indicators, and hence provides a huge opportunity for data exploitation [2].

What do all these things have in common?  All of them are spaces that are dominated by Big Data.

While there are some specific technologies that are often associated with Big Data, Big Data itself isn’t a technology; it is a description of a set of challenges in today’s world that are faced in these and many other domains.  What are the recurring aspects of a Big Data challenge?

1) Large amounts of data.  All of the examples I mentioned above—web traffic, social networks, financial transactions, and scientific measurements—are producing data at a faster rate than has ever been imagined before.  This has resulted in an apparent explosion of available data.  This driving force has given Big Data its name.

2) Complexity of data, and of the questions we need to ask.  User behavior, market analysis, scientific conclusions: to gain insight into any of these things requires creativity and a deep understanding of the complex interplay of many factors.  Simple data analysis was sufficient in the olden days.  Now we need complex answers to complex questions to understand the world and be competitive.  In many cases, we need those answers fast – sometimes lightning fast. [2]

3) Heterogeneity of data.  In the case of gathering data from a single web site, the data might come from a single source.  But in most of the cases I mentioned before, data will come from many sources.  Social networks track behavior across multiple sites and applications.  Financial data includes market intelligence, as well as massive transaction data.  Contributions are being made to the world’s collection of scientific data from all over the world.

These challenges have caused us to move our thinking about how to manage data from the tried-and-true methods that have held sway in enterprises for the past 30 years.  During that time, data management was done largely within a single organization (mostly homogeneous) and the data structure was largely understood (going all the way back to the “Master Data Record” that many businesses had for a long time).

Ironically, Challenge #1 (large amounts of data) is the one from which the Big Data movement gets its name, but is also the only one that has been addressed systematically by data management technology for decades.  The scale of relational database systems has improved steadily over the years; the ability of these systems to handle complex and heterogeneous data has not been as much a focus of development.

Most new technologies that have been developed to address the Big Data challenges have begun from a vantage point of Challenge #1. From there, they have taken different approaches to addressing challenges #2 and #3.

The World Wide Web Consortium has come at this problem from another angle. Starting naturally enough (for the W3C) with distribution of data as their point of focus, they developed RDF, a framework for managing data resources distributed over the web, and SPARQL, a powerful query language for RDF.

RDF achieves its data distribution goals by representing data as a graph; SPARQL provides a powerful way to manage the complexity that results from combining distributed data sets. Many critics have doubted whether such an approach, driven primarily as it is by data distribution and complexity concerns, can be further developed to address the scale issues of Challenge #1.

Just because an RDF database focuses on the complexity and diversity issues of the Big Data challenge, doesn’t mean it can’t deal with large data sets as well.  Many RDF databases and SPARQL engines today are able to scale to very large sizes.

YarcData’s Urika™ technology is a good example. Urika™ directly addresses all three challenges of Big Data. As a RDF database, it excels in diversity and distribution of data.  Its highly parallelized architecture lets it excel at complex queries, achieving fast response times even for complex queries and for large data sets.

With the advent of high-performance RDF databases, these W3C technologies have become a key player in Big Data technology.

[1] http://www.gartner.com/newsroom/id/2359715
[2] http://www.thedailybeast.com/newsweek/2013/01/04/eunuchs-of-the-universe-tom-wolfe-on-wall-street-today.html

Dean Allemang, co-author of the bestselling book, Semantic Web for the Working Ontologist, is a consultant, thought-leader, and entrepreneur focusing on industrial applications of distributed data technology. He served nearly a decade as Chief Scientist at TopQuadrant, the world’s leading provider of Semantic Web development tools, producing enterprise solutions for a variety of industries. As part of his drive to see the Semantic Web become an industrial success, he is particularly interested in innovations that move forward the state of the art in distributed data technology.  Dean’s current work is concentrated on the life sciences and finance industries, where he currently sees the most promising industrial interest in this technology.