Issues with Data Normalization in RDF

TimBack around 2004 to 2008, I worked a project that was taking the approach of modeling data via ontologies using some propriety methods.  As I sought out more standard methods for data modeling I found some of the new methods we find in today’s Semantic Web technology stack.  I can recall many, many conversations on the topic of “fusion”, the term we used to describe determining that two data instances were equivalent (and the need to maximize fusion to search for linkages between instances).  E.g.  “Osama” in one context, being considered the exact same as “Usama” in another, or similar context. The problem is daunting and represents one of the more difficult challenges facing the natural language processing field today.  Those challenges are beyond the scope of today’s blog, but quite similarly the problem arises in RDF, and that is the topic for today.

As my interest in Semantic Web began to grow from those early days I began to see the problem re-appear at the lowest levels of RDF.   At the higher levels we use ontologies, reasoning techniques and description logics to help determine that two things are equal.  But in RDF itself, we have the problem as well.  Some of the problem is addressed in RDF and by SPARQL, but some is not.  Today, I want to categorize broadly, as a practitioner, on these situations that arise in working with RDF and hope to spur some discussion about them in an effort to learn more.

  • plain literals vs typed literals
  • literals with and without language tags
  • UTF normalization

Plain literals vs typed literals

            In RDF, it is possible to represent a literal, plainly (i.e., providing no data type information) and to also type a literal using XML schema types.  Consider:

"foo"
"foo"^^<http://www.w3.org/2001/XMLSchema#string>
"foo"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#plainLiteral>

In RDF 1.0 and SPARQL, with a few exceptions where there is freedom in the specs[6], plain literal strings are not equivalent to string typed literals which are not equivalent to RDF plainLiteral type.

Consider also the following representations for the integer 1: "1"^^xsd:integer, "+0001"^^xsd:integer and "1"^^xsd:byte.  All are syntactically different but represent the same entity[2].

 Consider, similarly, the issue in real numbers.  "2"^^xs:decimal and "2.0"^^xs:decimal, again these will not be considered equivalent in a query.

Literals with and without language tags

Language tags also introduce this problem and can further sabotage any effort to maximize any effort to fuse data literals.  According to RDF a plain literal is not the same as a literal containing a language tag, which makes sense unless you are trying to maximize fusion.  So, let’s consider the following:

  • “foo” and “foo”@en are not equal
  • “foo”@en, “foo”@En and foo@EN are equal (thank heavens)
  • “foo”@en and “foo”@eng are not. (ISO-639-1 code versus ISO-639-2 code [4])

UTF Normalization

RDF data is represented in UTF-8 form, which allows for the encoding of multiple languages, a necessary condition for a data representation that intends to be the data language of the World Wide Web.  But UTF introduces some normalization issues of its own.  To quote Wikipedia[5]

For example, the code point U+006E (the Latin lowercase “n”) followed by U+0303 (the combining tilde “◌̃”) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter “ñ” of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.

But SPARQL does not abide by this (a decision most likely made to ease a potentially huge processing burden for triple store implementations).  As we can see by running the following test query at sparql.org

    ASK WHERE {
       ?s ?p ?o .
       FILTER( “\u006E\u0303″ = “\u00F1″ )
    }

Which returns the answer “no”.

Further Thoughts

As we can see, even at the lowest levels of data representation we can run into issues of equality.  To a system looking to maximize query results this becomes a problem that must be addressed.  This leaves one to wonder:

  • How could we address this problem without normalizing data on ingest?
  • If we did choose to normalize data on ingest, what would such a process look like?
  • Are there any systems today that attempt to address this problem?
  • Are there any similar problems that exist solely in RDF or SPARQL other than those mentioned here?

In a future blog I hope to address some of these remaining questions.

[1] http://answers.semanticweb.com/questions/3832/plain-literals-vs-xsd-typed-literals
[2] http://answers.semanticweb.com/questions/9781/regarding-plain-literal-and-rdfplainliteral-equality
[3] http://richard.cyganiak.de/blog/2011/05/the-rdf-11-literal-quiz/
[4] http://www.loc.gov/standards/iso639-2/php/code_list.php
[5] http://en.wikipedia.org/wiki/Unicode_equivalence
[6] http://answers.semanticweb.com/questions/16864/matching-typeduntyped-literals-in-sparql-joins

Leave a Reply

Your email address will not be published. Required fields are marked *

*


− two = 4

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>