Hadoop, like my son, is now over 10 years old. When my son was born I had hair, the Blackberry was cool, and the iPhone would not be available in stores for another two months. When Hadoop was born Hitachi released the first 1 terabyte drive, but the average desktop hard drive was 30-100GB, 80 IOPS per spindle was pretty good, and relative to today IP and storage area networks were the pseudo-straw I use to stir my coffee. In 2007 analytics had a problem. How do I address large datasets with massively parallel computational tools generating large volumes of IO sitting on a shared array at the far end of the aforementioned shared coffee straw? You could not. In 2007 working at a services company I can remember the childless, bigdatabeard-less me running hands through my ample hair asking, “What the hell is this application?”, as a customer’s Hadoop-based learning system repeatedly fell over nearly taking a multi-customer array with it. Now it is 2018 and the bigdatabeard is stronger with this one and the only thing flashier than my dome are 3TB SSDs.

Hadoop Distributed File System (HDFS), and its predecessor Google File System (GFS), are by their nature are a software defined storage (SDS) system. HDFS distributes I/O by striping blocks of data over multiple disks to provide fault tolerance and IO performance. Similarly, Splunk replicates index data across multiple file paths on distributed hosts. In a similar vein, Dell EMC ScaleIO and VMware vSAN use storage and compute resources of commodity hardware to create virtual pools of block storage, but with enterprise features and leveraging enterprise IT tools. If our work with Splunk is any indication, not only are modern SDS systems highly performant but there is a great deal of value in being able to provide elasticity, scalability, and enterprise tools to analytics infrastructure. Additionally, the same platform can be leveraged for multiple data sources and analytical tools – Hadoop, ELK, RDBMS, and/or Splunk on the same shared infrastructure platform.

The best part of my job is hearing stories and feedback from practitioners who are doing cool and innovative things in the wild with technology.  One of the most prevalent analytics workloads for ScaleIO and vSAN is Splunk.  Both were validated in conjunction with Splunk in 2016 (see below). I recently met with a Splunk administrator who is converting from their legacy bare metal, DAS-based infrastructure to vSAN SDS on VxRail for their cyber-security platform.  The only complaint was in their own words a first world problem: the system is so fast they are having trouble hitting F12 in time on boot. Another colleague tells me their customer, a large national financial, is running ELK on ScaleIO and showing phenomenal results. Finally, in one of the most interesting developments, a healthcare provider is apparently running Hadoop on VxRail in production using vSAN fault tolerance (FTT=1) and no block replication in Hadoop. The feedback from their team was that everything is going great.  This last use case is particularly appealing because not only is the customer able to leverage VMware for Hadoop, but the effective storage utilization is roughly 33% less than traditional 3 copy HDFS. It should be noted that this is not a tested configuration in the VMware vSAN whitepaper (below) and it remains to be seen what the performance or other penalties may exist relative to the gains in storage efficiency. The only performance data I have is anecdotal. What does their VMware SE or their Hadoop distribution vendor say about that, if anything? Don’t know. Time will tell.

It is arguably still early days, but I am cautiously optimistic and excited by the prospects and potential performance of enterprise SDS for big data and analytics.  Most enterprises have spent more than the last decade seeking to consolidate, virtualize, and gain efficiency and elasticity. DAS-based big data and analytics has historically, and in many respects for good reasons, run counter to that effort. However, the vision and promise of use case specific SDS systems like HDFS have now been embraced by general purpose, enterprise-level SDS like ScaleIO and vSAN. I have learned a lot from my son over the last 10 years, and I have definitely had to adapt, but at the same time I would like to think he still has more than a few things to gain from me; old, bald, and bigdatabearded that I am.

Dell EMC Splunk Partner Page (Splunk Validations)

Hadoop on vSAN