Isilon and Hadoop – The Enterprise Data Lake Substrate

The value proposition of Hadoop for most organizations is that it provides a low-cost way to keep large volumes of data online and creates a framework for analyzing that data to uncover new value drivers in their business.  However, Hadoop is not the silver bullet for Big Data; Hadoop is a storage substrate and a set of really handy tools that allow us to ingest, process, exploit, disseminate, and archive data very effectively.  Its value, whether deployed on Isilon or in physical servers, is more in its adoption as a standard framework for analyzing unstructured data and how that framework is being integrated with by so many other really interesting and useful data technologies.  That’s the power of Hadoop, the ecosystem of knowledge and integrated tools.  Hadoop has a ton of promise because of the scale of data we might collect and analyze in a cost effective manner, but for many companies, Hadoop environments lack many of the critical capabilities that enterprise IT has come to expect from every other application stack in the data center.

For the last ten years, many IT professionals have operated with a few enterprise IT standard, with regards to designing infrastructure to support critical business applications:

  • Disaster Recovery and High Availability
    • Applications and Data must be protected from accidental deletion, application corruption, and site disasters so that business can continue, even when infrastructure fails, which it does regularly.  This can include backups, replication, and off-site retention.
  • Security and Compliance
    • Applications and data must be secured from unwarranted access, both internally and externally.  It must be retained to meet regulatory compliance and that must be verifiable to the regulating bodies.
  • Virtualize-First
    • Applications and data must be virtualized, not only for efficiency, but for mobility and scalability as well.
  • Consolidate. Consolidate. Consolidate.
    • Applications and data must be consolidated to run on as small of an infrastructure as is responsible so as to avoid unnecessary expenses for energy, real estate, maintenance, and licensing.
  • Scalability and Efficiency
    • Applications and data must reside on infrastructure that is easily scalable to meet the increasing demands on IT, without increasing complexity.

However, these requirements are often overlooked when people start talking about Hadoop.  While I understand the business transformation that analytics powered by Hadoop might deliver are incredibly exciting, I don’t believe the standards of enterprise IT should be mutually exclusive to delivering on the promises of Big Data.  Let’s look at those same key tenants IT professionals have been using for many years and understand the differences between how Hadoop is traditionally deployed with direct attached storage in server (DAS) and how Isilon can be leveraged to better align to enterprise IT standards:

  • Disaster Recovery and High Availability
    • Data Protection – Hadoop does a 3X mirror for data protection and has no replication natively…Isilon supports snapshots, clones, and replication natively.
  • Security and Compliance
    • Security – Hadoop does not support kerberized authentication as it assumes all members of the domain it is in are trusted…Isilon supports integrating with AD or LDAP and gives you the ability to safely segment access based on policy.
    • Compliance – Hadoop has no native encryption or enforced retention…Isilon supports Self Encrypting Drives across our entire range of nodes and has compliant retention leveraging WORM.
  • Virtualize-First
    • Multi-Distribution Support – each physical HDFS cluster can only support one distribution of Hadoop…Isilon allows you to co-mingle physical and virtual versions of any apache standards based distributions of Hadoop that you like.
    • Virtual Hadoop – Most Hadoop implementations leverage many physical servers…Isilon provides a Hadoop Starter Kit that can leverage VMWare vCenter 5.5+ with the Big Data Extensions for a wizard based approach to deploying Hadoop Compute-Only resources on demand.
  • Consolidate. Consolidate. Consolidate.
    • Data In-Place Analytics – For many use cases, Hadoop requires a landing zone for data to come to before using “distcp” or something like Flume or Sqoop to ingest data to the cluster…Isilon is true multiprotocol, so we let data land on Isilon via any standard NAS or Object protocol and bring Hadoop to the data by providing HDFS as a protocol (think about trying to push 100TB across the WAN and waiting for it to migrate before analysis can start…we do it in place).
  • Scalability and Efficiency
    • Dedupe – Hadoop natively 3X mirrors files in a cluster, meaning 33% storage efficiency…Isilon is typically 80% efficient and we have sub-file level deduplication for further efficiency gains.
    • Scale Compute and Storage Independently – Hadoop marries the storage with the compute so if you need more space, you have to pay for more CPU that may go unused or if you need more compute, you end up with lots of overhead storage capacity…Isilon allows you to scale compute as needed and Isilon for storage as needed; aligning your costs with your requirements.

Say this three times fast –> Hadoop-Dedupe-Dedupe-Dedupe

This melodic incantation is the distillation of Isilon’s value proposition for Hadoop done by some witty Isilon employee…I cannot imagine why it has not caught on yet. This catchy tune tells us that Hadoop is a critical technology framework in the new world of Big Data and you should absolutely be leveraging Isilon for your Enterprise Data Lake because we:

  • Hadoop – with HDFS on Isilon, we dedupe storage requirements by removing the 3X mirror on standard HDFS deployments because Isilon is 80% efficient at protecting and storing data.
  • Dedupe – applying Isilon’s SmartDedupe can further dedupe data on Isilon, making HDFS storage even more efficient.
  • Dedupe – by using HDFS on Isilon, we remove the need to have a separate landing zone for data…we bring the analytics to the data, not the other way around.
  • Dedupe – by leveraging vHadoop, we dedupe the number of servers required to run Hadoop

For the less whimsical, more detail-oriented types, this side-by-side comparison of Hadoop on DAS versus HDFS on Isilon should be more to your liking:

Screen Shot 2014-10-13 at 2.26.08 PM.png

What many of my customers are trying to do is build the foundation for a Data Lake; a place where internally sourced data could co-reside with interesting data sets sourced externally and is co-processed to give our analytical queries greater scope and context.  Having a Data Lake that is capable of granting access to data via multiple standard protocols and access methods is key, but so is being able to derive value from the data leveraging the powerful Hadoop ecosystem.  A Data Lake has to meet enterprise IT standards and it has to be Hadoop friendly; with Isilon, you can have it all!

Why Isilon is awesome for Hadoop.

Here at Isilon, we get involved in a lot of these conversations around Data Lakes and customers generally like the idea that we can provide the HDFS protocol access on Isilon.  The fact that Isilon can play nicely with physical, virtual, and multi-distribution environments for Hadoop without the need for the traditional dedicated server stack being setup in a Hadoop silo aligns nicely to their goals of efficiency and consolidation.  We have also had quite a bit of success and interest from customers in highly regulated industries because of the way we bring enterprise features like data protection, performance tiering, encryption, compliant retention, dedupe, secure authentication, and highly efficient storage to Hadoop environments in ways no other solution has figured out.

Screen Shot 2014-10-09 at 11.32.51 PM.png

So, we may be a little biased, but we believe leveraging Isilon as the storage technology to support and deliver Hadoop functionality is pretty slick, especially given that we can be so much more than just an HDFS storage environment by delivering next generation access to data via protocols like NFS, CIFS, FTP, HTTP and REST.   Surprisingly to many customers, we bring all this value to Hadoop while offering a significantly lower TCO than traditional approaches.

Screen Shot 2014-10-13 at 2.14.10 PM.png

Did I mention that we are just as performance oriented as the DAS approach to Hadoop?  By bringing Hadoop to the data and not bringing data to Hadoop, we typically can process queries sooner, but we can also process faster. Check out the example below that outlines the entire Hadoop job cycle from data ingest to viewing results.

Screen Shot 2014-10-13 at 2.52.06 PM.png

Now, from time to time, I encounter the Hadoop expert that will argue mightily that disk locality matters in Hadoop and that our approach negates this from a Hadoop context.  The argument here is that part of the performance for Hadoop is derived from having blocks of data stored on hard drives that exist on the same servers that are doing all tasks associated with MapReduce jobs. While we do implement HDFS differently, I generally bring this commentary back to a couple of facts:

  1. MapReduce is a batch, distributed job typically running on servers connected via 1G star networks.  Batch and 1G networks are the keys here…these jobs are not high speed, low latency because the infrastructure is not networked that way.  Its moot as there are better ways to process data at speed…Hadoop is Big Data, not fast data.
  2. Facebook found that only 34% of tasks run on the same node that has the input data.  Disk-locality was not a big deal for Facebook, so is it for you? See the article here.
  3. Today, a single non-blocking 10 Gbps switch port (up to 2500 MB/sec full duplex) can provide more bandwidth than a typical disk subsystem with 12 disks (360 – 1200 MB/sec).

While I fully admit that I am celebrity endorser of Isilon as a Data Lake platform who is paid to talk about how great Isilon is for Big Data environments, let’s just recap a few things that I think we have proven here:

  1. Isilon delivers enterprise-class IT standards in our deployment of Hadoop beyond the traditional approaches in the Hadoop ecosystem.
  2. Isilon provides an incredibly flexible storage substrate for an Enterprise Data Lake by providing multi-protocol access to data, including Hadoop.
  3. For Hadoop, Isilon is cheaper, faster, and easier to deploy Hadoop in the enterprise.

Sounds pretty great, right?  Say it again with me…Hadoop, Dedupe, Dedupe, Dedupe!