In the first post of this series we looked at what iterative algorithms are. In this post we’ll look at the actual SPARQL queries that implement the peer-pressure clustering algorithm.
The algorithm starts with a set of vertices and edges between pairs of them. The code, simply stated, is:
assign each vertex to an initial cluster
do {
assign each vertex to the most popular cluster of its neighbors
} while (enough vertices changed clusters in this iteration)
Each statement results in a query. We chose to put each vertex in an initial cluster named the same as the vertex. The SPARQL query for this looks like:
[ 1] DROP GRAPH <urn:ga/g/xjz0>
[ 2] CREATE GRAPH <urn:ga/g/xjz0>
[ 3] INSERT {
[ 4] GRAPH <urn:ga/g/xjz0> {?s <urn:ga/p/inCluster> ?s }
[ 5] }
[ 6] WHERE {
[ 7] SELECT DISTINCT ?s WHERE {
[ 8] ?s <urn:ga/p/hasLink> ?o .
[ 9] }
[10]}
The cluster assignments are extra data that we’re adding to the database. To avoid cluttering up the default graph and make it easy to find them, we put them in named graphs chosen for this execution of the algorithm (“xjz” in the examples), so their names do not collide with other executions of the algorithm. In lines 1 and 2, we delete via the DROP construct any graph of that name and CREATE a new empty graph. The SELECT clause on lines 7-9 finds all vertices in the default graph that are the subject of a hasLink predicate and, for each unique such vertex, then on lines 4-6 INSERTs into the named graph a new triple of the same subject, the inCluster predicate, and the subject name as the cluster name.
We have used URIs starting with the urn: prefix so we can freely make up a namespace with no concern of collisions with other usage of the same strings elsewhere. These URIs will not be dereferenceable, which in this situation is not a problem. In <urn:ga/p/hasLink> and <urn:ga/g/xjzi> the “ga” stands for graph analytics, the “p” for predicate, and the “g” for graph name.
Typically the second query and the third query will execute repeatedly. The second query calculates new cluster assignments based on current cluster assignments.
[11] DROP GRAPH <urn:ga/g/xjzi+1>
[12] CREATE GRAPH <urn:ga/g/xjzi+1>
[13] INSERT
[14] {
[15] GRAPH <urn:ga/g/xjzi+1> { ?s <urn:ga/p/inCluster> ?clus3 }
[16] }
[17] WHERE {
[18] { SELECT ?s (SAMPLE(?clus) AS ?clus3)
[19] {
[20] { SELECT ?s (MAX(?clusCt) AS ?maxClusCt)
[21] {
[22] SELECT ?s ?clus (COUNT(?clus) AS ?clusCt)
[23] WHERE
[24] {
[25] ?s <urn:ga/p/hasLink> ?o .
[26] GRAPH <urn:ga/g/xjzi> { ?o <urn:ga/p/inCluster> ?clus }
[27] } GROUP BY ?s ?clus
[28] } GROUP BY ?s
[29] }
[30] { SELECT ?s ?clus (COUNT(?clus) AS ?clusCt)
[31] WHERE
[32] {
[33] ?s <urn:ga/p/hasLink> ?o .
[34] GRAPH <urn:ga/g/xjzi> { ?o <urn:ga/p/inCluster> ?clus }
[35] } GROUP BY ?s ?clus
[36] } FILTER (?clusCt = ?maxClusCt)
[37] } GROUP BY ?s
[38] }
[39] }
Lines 22-27, for each vertex, returns the vertex, the cluster assignments of its neighbors, and the count of vertices in that cluster. Lines 20-29 calculate the maximum cardinality of the clusters of the neighbors of each vertex. Lines 30-36 calculate the cluster assignment of that maximum-cardinality cluster. (SPARQL lacks a construct that returns the maximum value of one intermediate result and the corresponding element of another intermediate result.) Lines 18-38 join the maximum cardinality with the cluster name and also, in the case of a tie in maximum cardinality, break any tie by SAMPLEing a cluster assignment for each cluster. Lines 14-16 INSERT the new cluster-assignment triples into the named graph.
Because this query reads and writes to graphs whose names change with the iteration count, it is simple to create the queries in a scripting language (JavaScript in this case) to make these modifications to the query for each iteration. We have chosen a naming scheme of <graphName>i, where i varies across a range, to simplify debugging by examination of intermediate cluster assignments. An alternative (that is more memory conserving) would use just two named graphs (e.g., current and new) and MOVE new to current at the end of each iteration. The algorithm does clean up all but the named graph with the final cluster assignments, so the extra memory use is only transient.
[40] SELECT (COUNT(?oNew) as ?vccCt)
[41] WHERE {
[42] GRAPH <urn:ga/g/xjzi> {?s <urn:ga/p/inCluster> ?oOld}
[43] GRAPH <urn:ga/g/xjzi+1> {?s <urn:ga/p/inCluster> ?oNew}
[44] FILTER (?oOld != ?oNew)
[45] }
The second query executed in each iteration (lines 40-45) counts the number of vertices that changed cluster assignment in the just-completed iteration. The JavaScript loop that makes the queries then decides whether the algorithm has converged, either by the absolute number or percentage of vertices that changed or by a maximum iteration count.
I have omitted the text of a trivial query done at initialization time to count the number of vertices to be clustered.
Careful readers may note that the inner SELECTs at lines 22-27 and 30-35 are identical. While developing a complex nested query like this, needing to keep the same code in two spots identical is cumbersome. SPARQL 1.1 contains no good mechanism to define this code once and reuse it, like a function in a procedural language. The SQL WITH clause defines by name such a code block that can be executed wherever its results are needed.
The second and third queries above both have minor changes from one instance to the next (e.g., substituting “xjz2”, “xjz3”, … into the graph name). While these are not hard to cope within JavaScript code that creates the queries, it does mean that the query is literally different each time that it is executed. Hence, the SPARQL endpoint will have to reinterpret the query each time, which could at some point become time-consuming. SQL’s placeholder capability enables the passing of a value (of a given type) at execution time that is inserted at the placeholder’s position in the query, avoiding reinterpretation.
In this post we have looked at the SPARQL queries that implement peer-pressure clustering as an iterative algorithm. Next time we’ll look at the JavaScript code that creates the queries and calls the SPARQL endpoint.

In previous blog posts I’ve touched a little on equality and inequality in SPARQL; in this post I’m going to look at some of the more confusing aspects of these (and SPARQL expression semantics in general) that can surprise even seasoned SPARQL developers like myself sometimes.
Big Data is all the rage, especially since it received a new shot in the arm by a recent Gartner report. Everyone is interested in it, and Gartner [1] has thrown its hat into the ring for a definition of it. But even so, many people use these words in different ways. What is Big Data, and why all the interest in this topic now?