We’re moving very quickly into data sizes and result set complexities that exceed our ability as people to evaluate. That’s led to the rise of “big data” and data science. Our solution is an increasing reliance on machine learning methodologies to parse through what we can’t and reduce the complexity to something manageable. How we program that parsing is leading to a massive problem for data science. How we handle this and other challenges will determine the future of the profession.
The Problem of Bias
What’s signal and what’s noise or, asked another way, what’s important and what’s not? People develop heuristics to make that determination. Study decision making and it’s obvious that those heuristics are deeply biased. As a result we can make leaps of intuition and we can also fall for the simple tricks casinos play on us. Our biases blind us to certain information while causing us to rely more heavily on other types. The bottom line is that what people think is important is heavily influenced by our biases.
Machine learning in data science is represented as exactly the opposite. That’s the cornerstone claim, right? If machine learning’s capabilities are no different than our own heuristics then what’s the point? “Big Data” is supposed to be providing a different perspective; one rooted in data, able to see beyond our own biases and limitations.
Please don’t think I’m entering into the machine vs. human debate. I’m making the point that “Big Data” is supposed to be different than what we do on our own. Without that difference, it falls short of its potential. Is it still useful for simplifying large datasets? Yes but isn’t that just efficiency allowing us to speed up the process? That’s great but no major advance. We’ve been using technology to speed up processes for a while now. Will data science achieve its promise if it relies on the same heuristics we do? No more often than we would on our own which means, no.
But Data Scientists Are Trained To Detect Bias, Right?
We are. We look for bias in data sources and results. Any data source which excludes a portion of our target population is biased. When results mirror our assumptions too closely (100% of emails which contain the word ‘Viagra’ we reported as spam) they are looked at for bias. When correlations lead to a flawed hypothesis (As the numbers of pirates have declined average global temperatures have risen) they are tossed out as irrelevant using further experimentation to refute the hypothesis.
The experiment is really our last line of defense against bias. Experiments have moved us beyond faulty assumptions about a flat earth and taken black holes out of Sci-Fi. In data science, a disciplined scientific method is moving businesses past bad assumptions and proving out new business models.
Anyone who’s done a data experiment knows most are time consuming, labor intensive efforts. So to achieve the promise of “Big Data” which is the removal of our own biases to reveal genuine insights, we sacrifice the speed business craves.
Here’s the Problem
We are left with two options. Introduce our own biases and realize the same results we get on our own but faster. Remove our biases and make new discoveries slowly. Neither scenario is ideal. Again, I’m not disputing that progress has been made but what I’m saying is we’ve only achieved incremental progress while we’re promising a revolution.
Automation is the solution to this that I hear most often. We’re already automating the heuristic approach which dramatically illustrates the problems bias presents. Automating the experimental approach leads to a completely different issue. If you automate the experimental approach then the question becomes which hypothesis do we test? Again we need to remove our own biases so let’s automate the hypothesis discovery process. That leads to a lot more hypothesis discovered which leads to a lot more experiments and slows things down even more. Let’s automate a process to prioritize the most important hypothesis first.
Here’s where our AI starts quoting Socrates. Is the pattern important because the programmer thinks it is or does the programmer think a pattern is important because it is important? The first solution, the pattern is important because the programmer thinks it is, is obviously biased which we’re trying to avoid. The second solution means the programmer cannot be trusted as the source for the heuristic to determine what patterns are important. The machine must therefore create its own by experimenting with every pattern it finds to determine an unbiased heuristic.
What defines experimental success? Is a business model successful if it leads to short term profits at the expense of longer term success? Is a business model that pays tomorrow at the cost of today’s success better? Is a business model only a success if it works both in the short and the long term? Is success defined by revenue, margin, business value or some combination?
That’s the rabbit hole. Automation and every solution I’ve heard presented, breaks down to some level of bias which skews the results towards an unacceptably high level of failure.
Why Am I Tilting At Windmills?
If we don’t make some progress towards answering the big questions, this becomes just another IT fad. As data scientists, we have an opportunity to take what we’ve started and build a discipline with legs. We’re linked through our education and approach to academia and our value links us to business. That’s a rare pipeline. Showing the business value in decreasing the bias in reinforcement learning and unsupervised learning to improve the accuracy of prescriptive and predictive analytics is a big part of that. I think it’s the first big question we face and a make or break moment for our profession. We can take the hard road and work the solution or we can lower expectations.
I’m advocating for the hard road while I’m seeing a lot of colleagues working to lower expectations. I’m all for being realistic but a lot the initial projections for data science are realistic. Data driven business model generation, real time marketing personalization, real time pricing, demand forecasting, decision modeling, etc. are all attainable goals. I don’t see how backing away from what’s possible because we’ve encountered problems is part of the scientific or engineering approach. We run towards problems not away from them, right?
If you look at what Google’s done with data, their approach and success drive my sentiments. They’ve been faced with the choice of lowering expectations or working on complex problems throughout their time in business. They choose to work the hard problems around data collection, analysis and presentation. Typically they’ve flown against those who don’t understand why they’re tilting at windmills like self-driving cars, drones, augmented reality and many others. The results have built one of the most successful companies of our generation. If you look at their competitors who have taken to lowering expectations like Bing or Yahoo, the results have been significantly less successful.
In the current business climate, the problems we walk away from are the opportunities others seize. Choosing to work the problem is deciding to take our opportunity. So here’s data science’s moment; rise to the challenges or leave them for someone else.
But that’s just my bias. Yours is the one that counts.