Shotgun Analysis of Customers Via Big Data As Likely To Mislead As Help
Q: We hear a lot about “big data” as a tool to help individual companies understand their customers. What do you think?
Robert: Companies can certainly access more information about their customers, so you’d think that would lead to cost savings, better sales, or some positive outcomes for companies, and for some, like amazon and Google, that’s true.
However, for most companies, big data can confuse more than enlighten, because they collect huge amounts of data, then mine it for insights. That’s absolutely backwards, and leads to wrong conclusions.
Q: I don’t understand. Why is that backwards?
Robert: First, it’s not how science works, because the collection of data without knowing what one is looking for is a shotgun approach. It’s a little like netting thousands of fish without really knowing which fish you care about.
Second, the problem has to do with misleading results.
Q: How does that happen?
Robert: Most big data and data mining involves collecting data, then looking for patterns, or more specifically correlations between and among the variables/data points. The idea is to understand your customers better, to market better, that kind of thing by calculating those correlations.
It’s a bit more complicated than that, but the important thing to remember is that with the big data you calculate thousands, or tens of thousands of correlations. Many will be insignificant statistically, or practically. Since computers do the work, there’s not a huge overhead.
Q: So where’s the problem?
Robert: When you calculate a correlation coefficient, you need to determine whether it’s occurring by chance (A fluke), or because there’s a real relationship between two things. So, you determine its statistical significance according to some acceptable error rate. The error rate is the percent of all correlations that will be “wrong”, or happen just by chance. Typically that would be something like 0.95 or 95% accurate.
Q: OK.
Robert: So when you calculate a LOT of correlations, as is the case with big data, at the 0.95 confidence level, you’ll get five percent of your correlations that are actually false. So for every 1,000 correlations (which isn’t’ that much when it comes to massive data), you’d get fifty false findings — relationships that appear to be significant that are in fact, happening just by chance.
Q: So, are you saying that there will be 50 errors, but are you also saying that we won’t know WHICH 50 are wrong?
Robert: Bingo. So, if companies use a shotgun approach to big data, which is what they do, they WILL make some decisions that are based on faulty data analysis . The more they trust their conclusions, the bigger the mistakes.
Q: That’s interesting. So is there a solution?
Robert: In a sense. The solution is to resist the temptation to apply massive statistical analysis of big data, and do as they do in real science: Start with a hypothesis or set of them that specify the relationships between a limited number of variables, an do THAT analysis. In other words, collect data for specific purposes, and not just for pattern analysis.
Another way of countering the problem is to replicate findings over time. Results that occur over time (multiple “experiments” will not yield the same false relationships. They will move around. Relationships that are stable over time, and over samples likely to be valid, and important, too.
Q: Thanks for explaining this, Robert.