The Case of the “Forbidden Queries”

ArvindI recently spoke to a YarcData customer who has a very effective data warehouse that has been built up over a number of years. Like most data warehouses, they fine-tuned the data models and normalization of their data to optimize for the various queries across the KPIs, reports, and dashboards that their business users identified and prioritized. This data warehouse has been dimensionalized (is that a word?) across all the germane business parameters so their business users can run queries for various analyses along any one of those parameters.

The problem they face is that the business users kept coming up with queries that completely flew in the face of their organization, layout and dimensionalization of the data. These queries were initially called the “maverick queries” since they were completely off the wall and required join, after join, after join, after join. But every time someone ran one of these “maverick queries”, it brought the entire data warehouse to its knees and made it unavailable/unusable by the large number of users that needed the existing reports/dashboards. Consequently these “maverick queries” became the “forbidden queries” since no one was allowed to run them anymore!

The “maverick” users being of a strong and determined disposition started pulling extracts of the data warehouse into data marts to run their “forbidden queries” but this had its own challenges since they were working off a subset of the data and setting up and running these data marts took quite a bit of time. The worst part of the forbidden queries is that they are not only adhoc but also highly dynamic – the queries were constantly changing and by the time you built a data mart for one set of questions, the business was on to another set of questions.

Enter Graph Analytics. By representing the same data in the data warehouse as a graph, the data is now queryable (am pretty sure that¹s not a word!) along any dimension, any relationship. The approach is to load the YarcData graph analytics appliance every night from the main data warehouse and then run the “forbidden queries” on the YarcData appliance during the day thereby freeing up the data warehouse to be focused on the critical reports/dashboards. The users that wanted the traditional dashboards/reports were happy since the data warehouse was always available for their business critical operations. The “maverick” users that wanted the “forbidden queries” were ecstatic since they could now run the “forbidden queries” – some of the data might be a day old – but that is a huge improvement over not being able to run them at all.

While we have seen a lot of usage of Graph Analytics in traditional graph problems, this was an interesting, non-traditional use case for Graph Analytics – enabling existing data warehouses to handle their “forbidden queries”. What are your “forbidden queries”? How do you handle them?

Issues with Data Normalization in RDF

TimBack around 2004 to 2008, I worked a project that was taking the approach of modeling data via ontologies using some propriety methods.  As I sought out more standard methods for data modeling I found some of the new methods we find in today’s Semantic Web technology stack.  I can recall many, many conversations on the topic of “fusion”, the term we used to describe determining that two data instances were equivalent (and the need to maximize fusion to search for linkages between instances).  E.g.  “Osama” in one context, being considered the exact same as “Usama” in another, or similar context. The problem is daunting and represents one of the more difficult challenges facing the natural language processing field today.  Those challenges are beyond the scope of today’s blog, but quite similarly the problem arises in RDF, and that is the topic for today.

As my interest in Semantic Web began to grow from those early days I began to see the problem re-appear at the lowest levels of RDF.   At the higher levels we use ontologies, reasoning techniques and description logics to help determine that two things are equal.  But in RDF itself, we have the problem as well.  Some of the problem is addressed in RDF and by SPARQL, but some is not.  Today, I want to categorize broadly, as a practitioner, on these situations that arise in working with RDF and hope to spur some discussion about them in an effort to learn more.

  • plain literals vs typed literals
  • literals with and without language tags
  • UTF normalization

Plain literals vs typed literals

            In RDF, it is possible to represent a literal, plainly (i.e., providing no data type information) and to also type a literal using XML schema types.  Consider:

"foo"
"foo"^^<http://www.w3.org/2001/XMLSchema#string>
"foo"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#plainLiteral>

In RDF 1.0 and SPARQL, with a few exceptions where there is freedom in the specs[6], plain literal strings are not equivalent to string typed literals which are not equivalent to RDF plainLiteral type.

Consider also the following representations for the integer 1: "1"^^xsd:integer, "+0001"^^xsd:integer and "1"^^xsd:byte.  All are syntactically different but represent the same entity[2].

 Consider, similarly, the issue in real numbers.  "2"^^xs:decimal and "2.0"^^xs:decimal, again these will not be considered equivalent in a query.

Literals with and without language tags

Language tags also introduce this problem and can further sabotage any effort to maximize any effort to fuse data literals.  According to RDF a plain literal is not the same as a literal containing a language tag, which makes sense unless you are trying to maximize fusion.  So, let’s consider the following:

  • “foo” and “foo”@en are not equal
  • “foo”@en, “foo”@En and foo@EN are equal (thank heavens)
  • “foo”@en and “foo”@eng are not. (ISO-639-1 code versus ISO-639-2 code [4])

UTF Normalization

RDF data is represented in UTF-8 form, which allows for the encoding of multiple languages, a necessary condition for a data representation that intends to be the data language of the World Wide Web.  But UTF introduces some normalization issues of its own.  To quote Wikipedia[5]

For example, the code point U+006E (the Latin lowercase “n”) followed by U+0303 (the combining tilde “◌̃”) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter “ñ” of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.

But SPARQL does not abide by this (a decision most likely made to ease a potentially huge processing burden for triple store implementations).  As we can see by running the following test query at sparql.org

    ASK WHERE {
       ?s ?p ?o .
       FILTER( “\u006E\u0303″ = “\u00F1″ )
    }

Which returns the answer “no”.

Further Thoughts

As we can see, even at the lowest levels of data representation we can run into issues of equality.  To a system looking to maximize query results this becomes a problem that must be addressed.  This leaves one to wonder:

  • How could we address this problem without normalizing data on ingest?
  • If we did choose to normalize data on ingest, what would such a process look like?
  • Are there any systems today that attempt to address this problem?
  • Are there any similar problems that exist solely in RDF or SPARQL other than those mentioned here?

In a future blog I hope to address some of these remaining questions.

[1] http://answers.semanticweb.com/questions/3832/plain-literals-vs-xsd-typed-literals
[2] http://answers.semanticweb.com/questions/9781/regarding-plain-literal-and-rdfplainliteral-equality
[3] http://richard.cyganiak.de/blog/2011/05/the-rdf-11-literal-quiz/
[4] http://www.loc.gov/standards/iso639-2/php/code_list.php
[5] http://en.wikipedia.org/wiki/Unicode_equivalence
[6] http://answers.semanticweb.com/questions/16864/matching-typeduntyped-literals-in-sparql-joins

Practical SPARQL Benchmarking

RobThere is a certain amount of misguided belief in the market that Semantic Web technologies simply aren’t performant enough for the needs of a business and I often hear this presented as a reason for not choosing these technologies over a traditional RDBMS or other technology such as a NoSQL solution.

While there is some historical truth to these claims since these technologies are still relatively new there are now a slew of scalable and performant production ready systems arriving on the market from both commercial vendors and open source projects targeted at a variety of levels of scalability.  We ourselves at YarcData are building the uRiKA graph appliance which seriously pushes the boundaries of performance and scalability.

With an increasing number of products to choose from how does a business decide on the appropriate product for their business problem?

Typically people evaluate their options based on the vendor published benchmarks, but as I highlighted in my recent SemTechBiz [1] talk [2] there are some issues with this approach.  Firstly the standard benchmarks are all designed to benchmark stores in very different ways which may bear little or no resemblance to how you will actually use the product to solve a business problem.  Secondly vendors can be somewhat less than transparent about their methodologies and test environments.  Thirdly most benchmarks focus purely on speed and throughput.

Often the user is not interested in how fast a system answers their query but rather in whether it can answer their query at all.  A slower system that can answer a query is likely preferable to a faster system that fails on a query from a users perspective.  Ultimately a user must judge a system on whether it solves their business problem not on some benchmark that bears no resemblance to their problem.

To try and address these problems I presented a tool at SemTechBiz called SPARQL Query Benchmarker [3] that was developed internally here at YarcData for the purposes or running standardized repeatable benchmarks for performance and regression testing.  We found the tool so useful that we’ve made it open source and available to the community as we’re hoping to promote more transparency and repeatability in benchmarking.

The key features of this tool are a command line interface and API that allows you to run any set of queries against any SPARQL endpoint.  This empowers users with the means to gauge whether a system performs on their data, with their queries on their hardware and allows them to make an informed decision about which system is performant enough to solve their problem.

References

[1] http://semtechbizsf2012.semanticweb.com/
[2] Presentation slides: Practical SPARQL Benchmarking
[3] https://sourceforge.net/projects/sparql-query-bm/