Computerworld

INSIGHT: Big Data tells us what we want to hear

In the era of Big data, the theory goes that with more data we can learn more. What if that is wrong?

In the era of Big data, the theory goes that with more data, more inputs and more information, we can learn more, discover more and develop new insights.

What if that is wrong?

A “theory of everything” is a complex physical theory that all things in the universe are somehow related to each other and interact with each other all of the time.

General relativity and quantum field theory are frameworks that attempt to explain ToE. More recently, String theory proposes that there may actually be a single unifying theory based upon vibrational “strings” for preferred resonance and useful dissonance.

In physics, that’s a grand theory. And, for some unknown reason, human beings have begun to interpret our theories (which are merely rationalised explanations we use to understand something) as actual facts.

To quote everyone’s favourite science officer, we just “have a theory that happens to fit the facts.” It is actually a human interpretation of the real world that specifies physical data points—physics exists without us.

The good news is, physics are mechanical. Unfortunately, it is far easier to evaluate the physical relative relationships of tangible objects than doing the same thing for logical concepts. And, that my friend, is the problem with data.

Data documenting business processes is merely a rationalised explanation of a logical theory that is proposed by humans from a distinctive perspective.

In simple terms, all data is biased toward the creator. All data is captured from multiple perspectives and represents multiple points of bias.

That means each new data point reflects the intent of the business process designer. This means it is not possible to actually assemble new analytics from existing data.

Even more concerning, it is not possible to alter analysis with new data—you can only reinforce or refute the expected analysis. Any new analysis actually also follows the data design at its creation.

Since most new data is inserted into an existing model, it inherits the previous bias and is limited by those previous logical decisions.

Human logic is inherently embedded in all business processes and all data capture is biased toward the expected outcome of those processes.

Page Break

And since all data is pre-existing immediately after it is designed, it will always be tainted by the original design which assumes both the business process and meaningful questions about that process.

When a business analyst performs their task, they create inferences between data points where no actual data exists. When that same analyst becomes dissatisfied with their own theories of inference, they should seek new data to fill in the gaps.

Instead, most analysts resort to layering these inferences, all visibility into the underlying assumptions is lost and a data derivative is created. Adding more data does not necessarily support the analysis of new theories and often merely supports existing theory.

Data derivatives are reinforced over time, because they represent highly complex layers of logical arguments. But, once again, logic is an interpretation of facts.

So, what are we to do?

Over the past thirty-six months, Gartner has witnessed many data miners, data analysts, senior systems analysts and business intelligence professionals change their title to “data scientist”.

It is not unusual for professionals to adhere to market hype and promote their prestige and incomes by doing so. But, what business strategists need is “real” data science.

Real data science is the practice of building out competing interpretations of data, many multi-layered analytic theorems that intentionally challenge the inferences used by the others. True data science compares these theories along at least two axes.

First, how easy is it to trace the actual data used back to its originating business process? How many jumps or hops created the data? The number of assumptions and the complexity of the inferences between data points that are in use gives some idea of how reliable the data is.

Second, how far removed from the physical process world is the data point? Meters that record electrical pulses are pretty accurate; however, the manner in which they are recorded, the record layout, even the decision to record “meaningful change” in the electro-static state of a device, is a form of bias.

We record want we want to hear. Data science quantifies these distances and constructs models that test these assumptions, maybe fifty, maybe a thousand different ways. Then data science tests the veracity of the models.

Now, don’t take this as instructions on how to pursue data science yourself or how to identify a data scientist—I said at least two axes.

But, the next time you decide to make a decision based upon data, remember it only tells you what the process designer thought of at the time of deployment and literally everything else is only theory.

You need data science to identify where the inferences are becoming extreme (and becoming derivative) and: either obtain integral data to fill those assumptions with facts; or, your data science team must build multiple models that constantly challenge those over-burdened assumptions (derivatives) with competing inference laden theory.

And that, is how can start to actually listen to the data.

By Mark Beyer - VP distinguished analyst, Gartner