In my last blog, I shared some interesting articles and hopefully got you thinking, “Why can’t I solve my big data problems with all this progress in technology?” But let’s not get ahead of ourselves. Let’s start with the cause of the problem.
Why all the problems?
Technologies fail for reasons that are as wide-ranging as those that challenge almost any human endeavor. Sometimes the technology is complicated and misunderstood, resulting in its incorrect application. Other times the reasons are far more mundane and bureaucratic. Let’s look at some of the common patterns of failures so you don’t have to repeat them.
Failure to understand the data
This is by far the most common reason for reaching imprecise conclusions. It’s not enough that you have massive volumes of data, but you must have a general understanding of it. It’s a bit of a chicken-and-egg problem: Wasn’t big data supposed to let the data do the talking, and not the other way around? Theory-free conclusions can be dangerous because you don’t know or understand the causality in your data and the assumptions you’re implicitly making. In other words, you don’t know what will cause that model to break down. You’ve probably heard this mentioned many times, but correlation does not imply causality. Correlation is just one of the many tools at your disposal to help you understand your data, but if your conclusions depend on an implied order of events, then you might be in for a surprise.
Is the unsampled dataset representative of your population?
This one goes hand-in-hand with failing to understand your data but giving some thought to the sources of your data to ensure that the data is being generated uniformly for the entire population of interest. There are some excellent texts that deal with the design and analysis of experiments. You don’t have to be an expert, but some basic understanding of common mistakes can go a long way. For example, if you’re trying to use Twitter data to predict election results then the question you should be asking is “What percentage of likely voters use twitter?” This has also been a problem with using Twitter data for disaster recovery planning, because Twitter users are disproportionately young, urban, affluent smart-phone users who are likely to tweet near where they live. So any projections based on those tweets would be valid for such neighborhoods, but not for the entire population.
More data does not equal more information
This one should be obvious, but sometimes when you’re neck-deep into something, it’s easy to forget. Data is data, and only through analytics does it become information — and that’s when it becomes useful. Readers with a stats background should be familiar with the coefficient of determination, or R-squared as it’s commonly called. It’s a variable that can be used to describe your analysis. However, R-squared is a monotonically increasing value — add more variables to your data, and the R-squared of your analysis will increase even if the data doesn’t improve the regression model.
This is a fun topic, and quite extensive at that. All of these can affect your decision making. Let me tell you about a seminar on “Outcome Bias” that I didn’t attend because I knew what it was all about. How many times has that happened to you? Upper management is most often guilty of this (one might argue that parents are just as guilty). There’s a reason why humans suffer from bias, and it can be attributed to experiences learned from past events. However, we can be obstinate and refuse to change even when the data tells us we should. Make sure you check your data first. And then check it again. As the failure of big data in education showed, sometimes the problem lies not with the data, but in the suppliers and consumers of that data, who must be convinced of its benefits.
Not everything is a big data problem
You’ve heard the one about the boy with a new hammer who found that everything looked like a nail? To many folks who have just been exposed to the exciting prospects of big data, it is that proverbial hammer just looking for nails. You can refer back to the R-squared example above, but let’s pick a more exciting topic: wine production. You would think that given the variables in grape types, soil, weather, water, a winemaker’s skill, etc., wine production would be an ideal candidate for a big data solution. As it turns out, wine is not an ideal candidate because you need just three variables to predict the quality of the vintage. You could argue that there might be other aspects of wine production that could benefit from the “right” sort of big data, and you probably would be right.
Do you understand the limits of your hypothesis testing?
This forms the basis of how you test new theories. The problem here is not the data but the accuracy of the test itself — and it’s more of a problem in empirical tests that rely on physical techniques. However, improperly thought-out corner cases can affect any kind of test. False positives and false negatives can undermine your results and generally point to the need for more stringent testing methodologies. The Economist has an excellent video explaining this — but this situation is not as dire as they express.
Coming in part III
In the third and final part of this series, I’ll discuss what your big data strategy should be.