Equality and Inequality in SPARQL

RpbIn previous blog posts I’ve touched a little on equality and inequality in SPARQL; in this post I’m going to look at some of the more confusing aspects of these (and SPARQL expression semantics in general) that can surprise even seasoned SPARQL developers like myself sometimes.

Previously, I introduced the fact that the equality operators in SPARQL (= and !=) represents value equality and inequality. This means that non-identical RDF terms can be considered equal/non-equal if they represent the same value. Consider the following example:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ("1"^^xsd:integer = "01"^^xsd:integer AS ?equals) WHERE { }

The two RDF terms are not the same term, but they both represent the value 1 so the expression evaluates to give the value true:

----------
| equals |
=========
|  true  |
----------

Conversely, the following returns false:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ("1"^^xsd:integer != "01"^^xsd:integer AS ?notEquals) WHERE { }
----------
| equals |
=========
| false  |
----------

Malformed Data

Where things start to get tricky is once you start considering data that is not well-formed. Consider the following:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ("1"^^xsd:integer = "xyz"^^xsd:integer AS ?equals) WHERE { }

The first term is a valid integer but the second term is not, so what do you think the answer should be here?

I bet that most of you are thinking false, right?

Strangely enough you’d be wrong; this actually gives us an unbound value, which probably sets a few heads spinning.

----------
| equals |
=========
|        |
----------

The Operator Mapping

This behavior is due to something in the SPARQL specification which most everyday SPARQL users likely never worry about – the Operator Mapping [1]. What is happening is that the SPARQL engine is selecting different implementations of the operators to use based on the arguments provided to them.

In our first two examples, both arguments are valid numeric terms so the SPARQL implementation will use numeric equality/inequality; hence, we get true/false as appropriate.  In the third example, however, because one argument is an invalid numeric term the implementation cannot select numeric equality and instead selects RDF term equality. This returns true if, and only if, the terms are identical; otherwise, it produces a type error, it is this type error which results in no value for our third query.

SPARQL specifies that an implementation chooses the most appropriate operator implementation available and allows implementations to extend the Operator Mapping with additional datatypes if they so desire. This is why in some implementations you could legally get different results if your implementation understood how to turn “xyz”^^xsd:integer into a valid integer.

Type Errors?

Right now, you are probably wondering what exactly a type error is. Essentially, this is how SPARQL accounts for the fact that data is often messy. Often RDF data is generated or extracted from existing real world data sources that themselves contain errors and inconsistencies; this results in RDF data that often contains those same errors and inconsistencies.

Any expression in SPARQL expects its arguments to conform to certain datatypes; if they do not, then that expression will produce a type error. Type errors propagate up expression trees and may have different effects depending on the type of expression being used.

This is particularly relevant for operators like != because typically SPARQL defines them to be logical negation of their = equivalent, as logical negation of an error is still an error. The best way to see this is to try the != version of our third query:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ("1"^^xsd:integer != "xyz"^^xsd:integer AS ?equals) WHERE { }

This, like the previous query, still returns an unbound value for this reason:

----------
| equals |
=========
|        |
----------

FILTER and Type Errors

Where Operator Mapping and type errors really start to confuse people is in how these interact with FILTER, because FILTER is required to just provide a true/false to a SPARQL engine as to whether it should include/exclude a possible solution. Therefore, FILTER is defined such that a type error is treated as equivalent to a false for the purpose of filtering.

This can lead to some apparent inconsistencies if you are expecting = and != to be reflexive as they would be in most programming languages. For example, let’s turn one of our earlier queries into an ASK, like so:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
ASK { FILTER("1"^^xsd:integer = "xyz"^^xsd:integer) }

This gives us false, which if you’ve followed the explanations up to now you should be expecting. The expression produces a type error, which gets treated as false by FILTER.

So when we use != instead, we will still get false as our response:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
ASK { FILTER("1"^^xsd:integer != "xyz"^^xsd:integer) }

If you weren’t aware of the type error semantics of expression evaluation in SPARQL, then seeing these two queries producing these results is going to make you scratch your head in puzzlement.

Conclusion

Expression semantics in SPARQL, particularly around things that people are used to from other languages like = and !=, can often be confusing to new users and even befuddle experienced SPARQL developers like myself. As long as you are aware of type errors, you should usually be able to understand what is occurring. Remember that if you are using FILTER, type errors are treated the same as false. If you are having trouble understanding what is going on, it can be useful to use a project expression or a BIND instead to see whether a value is being generated or not. If there is no value generated, then you know you got a type error.

Acknowledgements

Big thanks to Paul Gearon and Andy Seaborne for helping me get some of the thornier parts of this straight in my own head so that I could write this blog post.  The example results given in this post are the result of running the queries at http://sparql.org.

[1] http://www.w3.org/TR/sparql11-query/#OperatorMapping

Tuning SPARQL Queries for Performance

RpbThere is often a perception that RDF/SPARQL based systems are not performant compared to traditional RDBMS based systems. Personally, I think this is a spurious and inaccurate portrayal of the technology for a couple of reasons:

1. Often people try to compare performance of the two styles of system for the same task; this is essentially making an apples-to-oranges comparison. The two systems are suited to answering very different styles of questions. Do you compare your document database or key-value store directly to a RDBMS? So why do people try and do the same for RDF/SPARQL?

2. RDF/SPARQL systems are by definition less mature; RDBMS have had 40 years of R&D to get to where they are today while the oldest RDF/SPARQL systems have around 10 years of R&D.

With regards to #2, I should make it clear that less mature doesn’t mean less performant. There are RDF/SPARQL systems out there today (both free and commercial) which deliver excellent performance on SPARQL.

However, one of the problems I frequently see with SPARQL performance is that because the technology is relatively new, users don’t necessarily understand how to write good queries in the same way that an experienced SQL developer does. In lots of cases this doesn’t matter because SPARQL optimisers and engines tend to be far better at optimizing whatever the user throws at them relatively well, but there are a few cases where users could potentially make their queries run much faster, and I’d like to share some of these.

Value Equality vs. Term Equality

Now a lot of users of SPARQL probably don’t even know what I mean by the above, but I would give you good odds that most people are using the former without realizing it or knowing that the latter could actually be more performant in many scenarios.

So to illustrate this, let’s show you the difference in syntax. First, let’s see value equality:

FILTER(?x = 1)

And now let’s look at the corresponding term equality form:

FILTER(SAMETERM(?x, 1))

The syntactic difference is obvious –value equality uses the = operator while term equality uses the SAMETERM function.  The latter is likely more performant but it does have a drawback in that it is only term equality. Term equality only returns true if the RDF terms are identical. So if the RDF term in the database was encoded as “001″^^xsd:integer term equality would give false whereas value equality would return true because the value of the terms is equivalent.

As a general rule, use term equality wherever the value of the term is not an issue, such as when you are matching URIs or non-value typed literals (e.g. plain literals, xsd:string, literals with language tags). Also if you know the RDF terms in your database are consistently encoded (or your database does this for you), then always use term equality unless value equality is more appropriate for your query.

Note that many SPARQL engines will be clever enough to turn value equality into term equality when possible, but not all will, so sometimes if you have a slow running query when using value equality, try switching to term equality and see if that improves things.

Using overly broad FILTERs or FILTERs which can be changed into triple patterns

Sometimes I see and cringe at queries like the following:

SELECT *
WHERE
{
?s ?p ?o .
FILTER(?o = <http://example.org/Class>)
}

This filter is both overly broad in that it requires the SPARQL engine to apply it over a large swathe of data which is in of itself a bad idea, but it can also be trivially rewritten as a simple triple pattern like so:

SELECT *
WHERE
{
?s ?p <http://example.org/Class>
}

Note that many SPARQL engines actually do a rewrite of the query similar to this when they see such queries, but this rewrite is not always possible depending on the RDF term and the complexity of the portion of the query that the FILTER applies over. If you find yourself writing FILTERs like this, consider using the latter form. This can also apply even when you have a more complex filter like so:

SELECT *
WHERE
{
?s ?p ?o
FILTER (?o = <http://example.org/Class> || ?o = <http://example.org/OtherClass>)
}

This query could be better rewritten as a UNION like so:

SELECT *
WHERE
{
{ ?s ?p <http://example.org/Class> } UNION { ?s ?p <http://example.org/OtherClass> }
}

Again, this will likely be much more performant because it creates a lot less work for the query engine than the FILTER form of the query.

Avoid SELECT *

A lot of users blindly use SELECT * for all their queries which can hurt query performance for a couple of reasons (even I tend to do this particularly when writing examples like earlier in this post). Firstly, this means that the database has to transfer more data back to you, so you get your results slower. More importantly, if you only select the variables you are actually interested in, the optimizer and engine may be able to evaluate your query faster because it doesn’t need to keep around all the variables in each solution for the whole duration of query evaluation.

Avoid DISTINCT

DISTINCT is often a very costly operation for a system and depending on implementation may require the full materialization of all results before an engine can start returning results. Unless you really need it, try to avoid DISTINCT; if you aren’t sure whether your query needs DISTINCT, you may want to try using REDUCED instead.

REDUCED allows the engine to choose whether to remove duplicate results. If you still get duplicates when using REDUCED (which may vary by implementation), then you can apply DISTINCT instead to force duplicates to be removed at some cost to performance.

Use LIMIT

This one is somewhat obvious: only ask for as many results as you actually need by using LIMIT.  Many systems calculate results in a streaming fashion anyway, so if you ask for fewer results, then they have less work to do and can complete your query faster.

Extending SPARQL with CONSTRUCT Sub-queries

RobRecently myself and some of my colleagues were discussing something that we consider a key limitation of SPARQL right now which is that is provides no direct mechanism to create a temporary graph from the existing data to use in your query. Yes, you can use SPARQL update to INSERT a new graph based on existing data but we shouldn’t need to create a new graph just to answer the odd query, and if we do this and forget to remove it when we’re done we can cause other users’ queries to return unexpected results because we’ve modified the database.

To clarify what I mean, consider the following use case. We have a RDF database consisting of multiple social graphs where each has been imported from a different source and so each uses a different predicate for representing the friend relation. Most of the time we are just asking questions of a specific social graph but sometimes we want to ask a question over some subset of those graphs, this leads to unwieldy queries like so:

PREFIX foaf:
PREFIX fb:
PREFIX twitter:

SELECT *
WHERE
{
  { ?x foaf:knows ?y }
  UNION
  { ?x fb:friend ?y }
  UNION
  { ?x twitter:follows ?y }
}

If we want to get more information out like the names, email addresses etc. of each of those people and each is in a different vocal in its own graph, then your query quickly becomes unmanageable to write, especially if you start adding more complex query elements like FILTER and OPTIONAL into the mix.

Now some of this can be solved by normalizing your data into a single vocabulary when you ingest it but this isn’t always ideal depending on your use case. To try and address these kinds of problems where you want that normalized temporary graph for some of your queries, we propose an extension to SPARQL that allows the use of CONSTRUCT as a sub-query.

Syntax-wise this would look something like the following:

PREFIX foaf:
PREFIX fb:
PREFIX twitter:

SELECT *
FROM
CONSTRUCT { ?x foaf:knows ?y }
WHERE
{
  { ?x foaf:knows ?y }
  UNION
  { ?x fb:friend ?y }
  UNION
  { ?x twitter:follows ?y }
}
WHERE
{
  ?x foaf:knows ?y
}

For a simple query like this it actually makes the query more verbose, but with more complex queries this syntax could actually significantly reduce the verbosity and make the WHERE clause of the query – i.e. the question being asked – much more readable.

Essentially our proposal is that you allow the use of a CONSTRUCT as an additional form of dataset description which gives us two new syntactic forms:

• FROM CONSTRUCT
• FROM NAMED CONSTRUCT AS

Similarly to a sub-select our proposal is to define the grammar for a sub-construct so that it cannot specify any FROM clauses; i.e. we prevent nesting of sub-constructs.

The proposed interpretation of this is fairly simple. Any number of FROM CONSTRUCT may be used to construct temporary graphs which are merged together to form the default graph of the query and may be combined with normal FROMclauses. FROM NAMED CONSTRUCT is slightly more complex, firstly theused to name the temporary graph may not be referred to by any other FROM clause which stops people forcing indirect nesting of these. Secondly if thegiven for the graph name is the same as the name of any graph in the query services default dataset the temporary graph hides that graph for the course of the query. Once these or any other FROM/FROM NAMED clauses are interpreted to create the dataset the query processing proceeds as normal.

What do people think, would this be a useful extension to SPARQL?