What are the Causes of Your Big Data Problem?

AdnanIn my last blog, I shared some interesting articles and hopefully got you thinking, “Why can’t I solve my big data problems with all this progress in technology?” But let’s not get ahead of ourselves. Let’s start with the cause of the problem.

Why all the problems?

Technologies fail for reasons that are as wide-ranging as those that challenge almost any human endeavor. Sometimes the technology is complicated and misunderstood, resulting in its incorrect application. Other times the reasons are far more mundane and bureaucratic. Let’s look at some of the common patterns of failures so you don’t have to repeat them.

Failure to understand the data

This is by far the most common reason for reaching imprecise conclusions. It’s not enough that you have massive volumes of data, but you must have a general understanding of it. It’s a bit of a chicken-and-egg problem: Wasn’t big data supposed to let the data do the talking, and not the other way around? Theory-free conclusions can be dangerous because you don’t know or understand the causality in your data and the assumptions you’re implicitly making. In other words, you don’t know what will cause that model to break down. You’ve probably heard this mentioned many times, but correlation does not imply causality. Correlation is just one of the many tools at your disposal to help you understand your data, but if your conclusions depend on an implied order of events, then you might be in for a surprise.

Is the unsampled dataset representative of your population?

This one goes hand-in-hand with failing to understand your data but giving some thought to the sources of your data to ensure that the data is being generated uniformly for the entire population of interest. There are some excellent texts that deal with the design and analysis of experiments. You don’t have to be an expert, but some basic understanding of common mistakes can go a long way. For example, if you’re trying to use Twitter data to predict election results then the question you should be asking is “What percentage of likely voters use twitter?” This has also been a problem with using Twitter data for disaster recovery planning, because Twitter users are disproportionately young, urban, affluent smart-phone users who are likely to tweet near where they live. So any projections based on those tweets would be valid for such neighborhoods, but not for the entire population.

More data does not equal more information

This one should be obvious, but sometimes when you’re neck-deep into something, it’s easy to forget. Data is data, and only through analytics does it become information — and that’s when it becomes useful. Readers with a stats background should be familiar with the coefficient of determination, or R-squared as it’s commonly called. It’s a variable that can be used to describe your analysis. However, R-squared is a monotonically increasing value — add more variables to your data, and the R-squared of your analysis will increase even if the data doesn’t improve the regression model.

Human bias

This is a fun topic, and quite extensive at that. All of these can affect your decision making. Let me tell you about a seminar on “Outcome Bias” that I didn’t attend because I knew what it was all about. How many times has that happened to you? Upper management is most often guilty of this (one might argue that parents are just as guilty). There’s a reason why humans suffer from bias, and it can be attributed to experiences learned from past events. However, we can be obstinate and refuse to change even when the data tells us we should. Make sure you check your data first. And then check it again. As the failure of big data in education showed, sometimes the problem lies not with the data, but in the suppliers and consumers of that data, who must be convinced of its benefits.

Not everything is a big data problem

You’ve heard the one about the boy with a new hammer who found that everything looked like a nail? To many folks who have just been exposed to the exciting prospects of big data, it is that proverbial hammer just looking for nails. You can refer back to the R-squared example above, but let’s pick a more exciting topic: wine production. You would think that given the variables in grape types, soil, weather, water, a winemaker’s skill, etc., wine production would be an ideal candidate for a big data solution.  As it turns out, wine is not an ideal candidate because you need just three variables to predict the quality of the vintage. You could argue that there might be other aspects of wine production that could benefit from the “right” sort of big data, and you probably would be right.

Do you understand the limits of your hypothesis testing?

This forms the basis of how you test new theories. The problem here is not the data but the accuracy of the test itself — and it’s more of a problem in empirical tests that rely on physical techniques. However, improperly thought-out corner cases can affect any kind of test. False positives and false negatives can undermine your results and generally point to the need for more stringent testing methodologies. The Economist has an excellent video explaining this — but this situation is not as dire as they express.

Coming in part III

In the third and final part of this series, I’ll discuss what your big data strategy should be.

The Definition of Insanity

TedAlmost ten years ago, when I was working at Pfizer, I wrote a position paper for a W3C Workshop on Semantic Web in Life Sciences. In that paper, I pointed out several vexing problems then faced by pharmaceutical researchers that I thought could be alleviated by use of a powerful knowledge architecture such as that enabled by Semantic Web technologies. Among these problems were those you might classify as knowledge management problems, and they had much to do with effectively sharing information throughout a large research organization where specialized vocabularies and varied purposes can easily get in the way.

Other problems I described were just good, old-fashioned informatics problems, in particular the creation of so-called “data silos.” Data silos are created when data are put into databases or documents in their own unique format, a bespoke schema for example, such that they are not interoperable with any other data. Data silos are destructive to a knowledge-based organization because they prevent the synthesis of disparate information necessary to gain useful insight into data. Data silos kill productivity and innovation, because they limit the kinds of questions researchers can ask of their data and are typically difficult to modify or expand (which means you’re probably not going to get very much help any time soon). Data silos happen because by now we don’t even think about what container we’re going to put our data in; it just reflexively goes into a relational database or a document. And so we almost never think ahead to consider how we’ll represent the knowledge so that we can actually put it to use along with other data.

Semantic Web technologies, including RDF(s) graphs, Linked Data, and SPARQL, provide a standard, uniform way to model data and capture their semantics, so they can really help with the problems described in that position paper.

But only if they get used.

The interesting thing about that position paper (and here I hasten to remind you that it was written a decade ago) is that I’m pretty sure I could submit it as-is today, and it would be just as correct as it was when I originally wrote it. Every single one of the problems described in that paper is still being experienced today by life sciences researchers in every substantial research organization. Except now there’s a lot more data, so those problems are even worse.

At that Workshop I showed a sort of tongue-in-cheek slide illustrating the current state of knowledge management in large life sciences research organizations. It looked like this:

Yeah, I’d still show that, too.

So, nothing has really changed in ten years. How can that be? These problems are serious, and the goals of life sciences research are simply too important to let anything stand in the way.

Did you ever hear that Albert Einstein said, “The definition of insanity is doing the same thing over and over again and expecting a different result?” Well, it turns out he probably never said that. (I know, I thought he did, too!) Nonetheless, I think he gets the credit because it’s such a smart way to look at it. Now, here’s the kicker: over the last ten years, the way we’ve been using computers to help us solve research problems basically hasn’t changed a bit. To put it another way, we keep putting our data into relational databases or documents like we’ve always done, and hoping this time it’s going to be different.

Well, I hate to be the one to tell you this, but it isn’t. And you just built yourself yet another data silo, didn’t you?

I’ll tell you what Albert Einstein did say: “We can’t solve problems by using the same kind of thinking we used when we created them.”

Part of what makes the life sciences so endlessly fascinating, so fun, is the sheer amount and variety of information, and how all those bits of information are related to each other. We owe it to ourselves to learn how to work with those data, so that we can understand what it all means and make a difference in the lives of the people who need our help. New things are scary, but think of what we could accomplish if we’re brave enough to put our energy into the data, not just into the containers for the data. Einstein would be proud.

Gooooooooooooaaaaaaaal! Data Analytics Could
Improve Soccer Results

MistiI’m a diehard soccer fan and have been glued to the screen during the World Cup games.

While watching the last game (viva Brazil!), I was thinking about how one of YarcData’s customers uses our technology to improve pitcher/batter lineups in baseball. I realized that data analytics could also be used for soccer (known in much of the world as fútbol), with some potentially interesting results.

Players are equipped with various devices to monitor heart rate and other factors so the amount of time spent strengthening, training and resting can be optimized. With the explosion of data and realization of the value behind it, every movement is now being recorded, from which foot the players are using to pass the ball to the number of steps they’re taking in a match.

With such a level of granularity in the data, soccer strategies including lineups and training schedules can be taken to the next level.

One of the most exciting matches of the World Cup was Brazil vs. Chile, which came down to a 1-1 showdown — but was unfortunately decided by penalty kicks. Chile, a slightly less skilled team, held its own but couldn’t pull through because Brazil’s Julio Cesar made some phenomenal saves. With Chile knocking on the door and looking to close the game, statistics about where players tend to place penalty kicks could have helped Chile’s goalie and turned the outcome of the game in their favor. How much money may have been lost in endorsements, prize money and other rewards for Chile?

If soccer teams could combine all their player data in a graph to discover new insights, athlete injuries could be mitigated or even prevented. Team owners’ multi-million-pound investments would be protected. If trainers had more real-time ways to look at which players are getting winded or are close to exhaustion, they might play them differently. And the ability to analyze patterns in team behaviors — like a propensity to attack or be defensive, or how players perform at different elevations or in various weather conditions — would help with game strategy. Even video analytics could be incorporated to predict and optimize lineups, putting the team in a position to win.

I’d be happiest if all this was possible for my favorite team – particularly if it would help them win! First choice: USA. Second choice: Brazil.