Facing the Challenges of Big Data

YarcData would like to introduce our first guest blogger, Dean Allemang. We’re delighted to have a noted expert on the Semantic Web join us for a series of blog articles discussing how graph analytics is quickly gaining prominence in the enterprise for solving complex business analytics problems.

DeanBig Data is all the rage, especially since it received a new shot in the arm by a recent Gartner report.  Everyone is interested in it, and Gartner [1] has thrown its hat into the ring for a definition of it.  But even so, many people use these words in different ways. What is Big Data, and why all the interest in this topic now?

For many people, Big Data is synonymous with search optimization (for many people, all of data science is synonymous with search optimization).  For commercial websites, visitor behavior equals revenue.  Understanding their behavior is essential for growing a web-based business and staying competitive.  So we gather data about our visitors, their habits, and try to glean some understanding of what does and doesn’t work to turn visits into value.

But for others, Big Data comes from outside a single web site.  Social networking sites provide access to huge amounts of data about user behavior.  If we could access and utilize this data, we could understand users who haven’t even visited our web sites.

You might not know it, if you just read articles about SEO, but there actually is a world outside of web commerce, and it is a big world.  Actually, several big worlds: there is a world of Finance, of Science, of Manufacturing, and these worlds are also data intensive.  Detailed financial behavior has a huge impact on financial indicators, and hence provides a huge opportunity for data exploitation [2].

What do all these things have in common?  All of them are spaces that are dominated by Big Data.

While there are some specific technologies that are often associated with Big Data, Big Data itself isn’t a technology; it is a description of a set of challenges in today’s world that are faced in these and many other domains.  What are the recurring aspects of a Big Data challenge?

1) Large amounts of data.  All of the examples I mentioned above—web traffic, social networks, financial transactions, and scientific measurements—are producing data at a faster rate than has ever been imagined before.  This has resulted in an apparent explosion of available data.  This driving force has given Big Data its name.

2) Complexity of data, and of the questions we need to ask.  User behavior, market analysis, scientific conclusions: to gain insight into any of these things requires creativity and a deep understanding of the complex interplay of many factors.  Simple data analysis was sufficient in the olden days.  Now we need complex answers to complex questions to understand the world and be competitive.  In many cases, we need those answers fast – sometimes lightning fast. [2]

3) Heterogeneity of data.  In the case of gathering data from a single web site, the data might come from a single source.  But in most of the cases I mentioned before, data will come from many sources.  Social networks track behavior across multiple sites and applications.  Financial data includes market intelligence, as well as massive transaction data.  Contributions are being made to the world’s collection of scientific data from all over the world.

These challenges have caused us to move our thinking about how to manage data from the tried-and-true methods that have held sway in enterprises for the past 30 years.  During that time, data management was done largely within a single organization (mostly homogeneous) and the data structure was largely understood (going all the way back to the “Master Data Record” that many businesses had for a long time).

Ironically, Challenge #1 (large amounts of data) is the one from which the Big Data movement gets its name, but is also the only one that has been addressed systematically by data management technology for decades.  The scale of relational database systems has improved steadily over the years; the ability of these systems to handle complex and heterogeneous data has not been as much a focus of development.

Most new technologies that have been developed to address the Big Data challenges have begun from a vantage point of Challenge #1. From there, they have taken different approaches to addressing challenges #2 and #3.

The World Wide Web Consortium has come at this problem from another angle. Starting naturally enough (for the W3C) with distribution of data as their point of focus, they developed RDF, a framework for managing data resources distributed over the web, and SPARQL, a powerful query language for RDF.

RDF achieves its data distribution goals by representing data as a graph; SPARQL provides a powerful way to manage the complexity that results from combining distributed data sets. Many critics have doubted whether such an approach, driven primarily as it is by data distribution and complexity concerns, can be further developed to address the scale issues of Challenge #1.

Just because an RDF database focuses on the complexity and diversity issues of the Big Data challenge, doesn’t mean it can’t deal with large data sets as well.  Many RDF databases and SPARQL engines today are able to scale to very large sizes.

YarcData’s Urika™ technology is a good example. Urika™ directly addresses all three challenges of Big Data. As a RDF database, it excels in diversity and distribution of data.  Its highly parallelized architecture lets it excel at complex queries, achieving fast response times even for complex queries and for large data sets.

With the advent of high-performance RDF databases, these W3C technologies have become a key player in Big Data technology.

[1] http://www.gartner.com/newsroom/id/2359715
[2] http://www.thedailybeast.com/newsweek/2013/01/04/eunuchs-of-the-universe-tom-wolfe-on-wall-street-today.html

Dean Allemang, co-author of the bestselling book, Semantic Web for the Working Ontologist, is a consultant, thought-leader, and entrepreneur focusing on industrial applications of distributed data technology. He served nearly a decade as Chief Scientist at TopQuadrant, the world’s leading provider of Semantic Web development tools, producing enterprise solutions for a variety of industries. As part of his drive to see the Semantic Web become an industrial success, he is particularly interested in innovations that move forward the state of the art in distributed data technology.  Dean’s current work is concentrated on the life sciences and finance industries, where he currently sees the most promising industrial interest in this technology.

Reflecting on our Graph Analytics Challenge

ArvindI wanted to thank the participants of our $100,000 Graph Analytics Challenge. It took our judges much deliberation to choose the winner, because each of the six finalists presented a complex problem and offered a unique, innovative solution. The entries spanned a number of diverse topics, from medical research to social collaboration to sports analytics. I could not be more pleased that graph analytics is drawing such high calibre experts who are so passionate about their work to better society.

As graph analytics gains traction, customers and analysts have asked me how I believe graph analytics can improve existing technologies and address real business concerns. Apart from my opinion, our contest has demonstrated the applicability of graph analytics to high impact use cases. From discovering a cure for autism or preventing crimes to predicting baseball outcomes – all are important issues with significant business and human impact.

The Graph Analytics Challenge has shown that the most complex problems involve discovering the previously unknown. Discovery is challenging because you don’t know in advance what queries you will run, or what data you will need. To quote Ilya Shmulevich from the Institute of Systems Biology: In the amount of time it takes to explore one hypothesis, we can now explore thousands of hypotheses, massively improving our success rate, I think that summarizes the YarcData value proposition in a nutshell.

I’d like to extend my thanks to all our contest participants and especially to our first, second, and third place winners respectively: Ilya Shmulevich, Brady Bernard, and Andrea Eakin, of The Institute for Systems Biology; Adam Lugowski, John Gilbert, and Kevin Dewesse, of the University of California at Santa Barbara; and Abraham Flaxman of the University of Washington Institute for Health Metrics and Evaluation. We look forward to seeing where all your research takes us into the future and how through systematized Discovery you are making the world a better place.

SPARQL 1.1 Becomes Official Recommendation Status

TimFor a couple of us “propeller heads” here at YarcData, we had a semi-earth-shattering moment when the SPARQL 1.1 specification became a recommendation. Yes, I’m aware of how that sounds; however, please bear with me a moment longer while I share a bit of my excitement.

I started working with ontologies back in 2001 and have come to love this form of knowledge representation and the semantic web technologies that accompany it. The idea of being able to describe the world in terms that are adequately descriptive and yet elegant is a bit of a geek love of mine. Once I discovered RDF and SPARQL, I have been working to do what I can, in my own little corner of the world (no jokes about that being a closet please), to progress the state of this technology that has such high potential to change the way we think and interact with computers.

These days, I am but one member on a team who qualify as implementers of this standard. While I am not a member of the Sematic Web Working Group, and thus not an authoritative source of information on their activity, I can tell you that work on SPARQL 1.1 has been in progress at least four years (the first draft was published October 22, 2009; the recommendation on Mar 21, 2013).

Coming from a RDBMS background as applications developer and DBA, I tend to enjoy working with SPARQL and trying to come up with elegant ways to query data and exploring the nuances of this language and its companion that is RDF.

So, now bear with me a moment longer while I share with you a “where were you when [X] happened moment…” Yesterday, a colleague, Rob Vesse, and I were filming a new video for our website. I was in the back of the room while he was shooting a segment in which he mentioned passingly in his dialog that SPARQL was still in “proposed recommendation” status. I thought to myself, “wouldn’t it be kind of funny if they released the final version today,” this having a fairly large impact on our jobs as implementers of this standard.

One of the more challenging tasks of any developer is trying to design and develop to a moving target, so we have been kind of anxiously awaiting the day it would become a “recommendation” of the W3C, the final state of any specification traveling through the W3C standards development process. So, immediately following the wrap of his shoot, I grabbed my laptop to check, and sure enough it was perhaps only hours after the semantic web working group decided to publish the final recommendation status of the SPARQL 1.1 spec. This produced a sufficiently funny “You’ve got to be kidding me” kind of a response from Rob, when I shared with him the news, since he didn’t want to go back and redo his whole spiel.

As a data geek, I dream of a day when people and computers are able to talk to each other in a language absent of ambiguity, holes, dead ends and “No results found.” And pardon me whilst I get my geek on, in what I think may someday be a small milestone towards that vision of the future…but thanks for hanging in with me while I did.

With that I bid you “query on…”