I recently held the main stage at Data Innovation Summit and spoke about the emerging discipline of data science. It was an outstanding event with delegates from many countries spanning multiple industries. Hadoop and ideas of a “data lake” were given considerable attention but not without being contrasted with an existing platform built to extract deep insights from data, the proverbial data warehouse.
The problem is that many organizations have been developing their enterprise data warehouse for years and years without seeing the returns once promised. Now there’s a sense of resentment from this group since modern data manipulation technology has come and leapfrogged past them without a second look. Understandably, data warehouse administrators and vendors will claim that so many cool things are possible on their platform, but the issue isn’t whether something is possible or not. The principle of computational equivalence says that you could use a pile of rocks as your big data platform if you wanted to. That doesn’t mean it’s a good idea.
The centerpiece of data warehouses has traditionally been a big relational database. The primary requirement of such a database is that all information must be represented as values in tables, manipulated by way of relational algebra (i.e. SQL queries). In addition, a longtime distinguishing feature of relational databases has been ACID compliance which gives strong integrity guarantees when data is modified. But because the main purpose of a data warehouse is to store and analyze data – as opposed to modify it – the added value of ACID transactions is marginal at best. The ability to work with big “tables” is not a unique selling point of relational databases either. Query engines like Drill, Impala, Presto, Spark, and Hive run on top of the Hadoop file system and can arrange data into tables using ANSI SQL queries with ODBC drivers that integrate with existing BI tools. Industry standard TCP-DS benchmarks have shown these engines to be, at the very least comparable, if not several times faster than their proprietary warehouse counterparts. To add insult to injury, the majority of enterprise data is not even generated in the form of a table to begin with:
- Arbitrarily serialized event streams
- JSON messages from phone applications or websites
- XML documents from line of business applications
- Graphs of customer, product, and transaction relations
- Free text written by your support center or customers
- Geospatial data describing operations
- Time series data from servers or IoT devices
- Binary data from images or audio
Organizations try their best to ETL what they can into the data warehouse, but we know it’s not pretty. Conversely, all of the above data types can be trivially ingested and manipulated using the standard tools of a Hadoop platform. Not only that, but machine learning using state of the art tools like XGBoost, H2O, and other frameworks can be performed without ever moving the data out to a separate “analytics environment”.
Now, to be fair, most database vendors like Teradata, Oracle, IBM, and HP have begun bundling old versions of Hadoop together with their standard offerings and even marketed it as a primary way to handle big data. But Hadoop is free and open-source so why would you pay a warehouse vendor for it? An administrator might say that storing, computing, and analyzing is one thing, but the unique selling point of a warehouse to them is the surrounding suite of proprietary tools and its ecosystem. However, that implies that the added value of the proprietary tools on their own justify their astronomic price. No, that’s absurd. A vendor might say they are “enterprise grade” and have “verified solutions” in your industry vertical, but what they’re really saying is “your competitors bought our machine at one point so you should too”. That’s hardly a convincing argument if you’re looking to gain an edge in your industry. Moreover, there is no doubt that niche Hadoop vendors provide outstanding support for Apache tools in the same way that Red Hat supports Linux tools.
Once other arguments for buying a warehouse are exhausted, out comes,
“You’re oversimplifying things!! The complex reality of enterprise business rules and governance can’t be reduced to open source tools that some hackers put together.“
I’m not really sure what to say when that one comes so I’ll just leave it hanging…
It would be a poor strategy for next-gen vendors to begin a sales motion by trying to displace data warehouses that have cost companies many many millions of dollars to put in place. Like Gartner says, data warehouses will undoubtedly be around for a long time. That being said, is the traditional data warehouse going to be the primary repository for data storage, mining, analysis, and machine learning of the future? The answer is No.