Is Big Data the new bacon? It certainly appears that way these days with the amount of excitement and energy swirling around these ideas and associated technologies. I can smell the sweet, smoky excitement every time I talk with customers about Big Data and one of the most tantalizing areas in Big Data right now is machine generated data. Think about how connected everything we touch is today…it’s not just the cell phone in your pocket; it’s every appliance in your home, it’s smart cars, intelligent jet engines, and every sensor you can imagine creating data constantly while we stroll through life. This is what we lovingly refer to as the “Internet of Things.” These devices along with servers, computers, or anything with a processor are creating logs. The logs might answer an enumerable number of questions if we could process them with context and correlate them with other data sources…sounds as tasty as a smoky piece of bacon, right?
When talking about machine-generated data (MGD for short…and no, that is not a reference to Miller Genuine Draft, but this article will be more enjoyable if paired with a few), I typically run into to two key technologies that customers are leveraging in some capacity, Splunk and Hadoop. However, they are not typically leveraging them together and we get a lot of questions as to why you would use one versus the other, both separately, or both together. So, having just spent a week at the annual Splunk user event, .conf, I am excited to share some of the more interesting conversation threads that seemed to resonate with folks when talking about Splunk, Hadoop, and Isilon:
- What is Splunk and why is it used?
- Hadoop versus Splunk – should it really be a fight?
- How Isilon’s Data Lake Strategy is absolutely relevant to Splunk.
What is Splunk and why is it used?
Splunk’s mission in this world is to be the company that makes machine data analytics accessible to everyone. They deliver on this mission by having a complete infrastructure to end-user experience platform that simplifies the deployment of a solution for deriving value from the massive amounts of unstructured and MGD. To set the table for how Splunk works, I’d like to share one of my favorite stories from the .conf event.
New York Air Brake is the market leader in locomotive braking and driver automation with more than 80% market share in that segment. Rather than resting on their laurels as the world leader in what is basically a manufacturing business, they created a division in their company whose sole mission was to leverage technology on trains to make them safer and more efficient. Their flagship product is an intelligent driver-assistance platform called Locomotive Engineer Assist/Display Event Recorder (LEADER) that will have a $1B ROI at a single US railroad company…that’s a “B” for billion! They ported the collective knowledge of their engineers and physicists on how to most efficiently drive a train into intelligent algorithms that could be run against the thousands of data points being created by sensors of all kinds along the length of the train. By leveraging their collective wisdom, the known physics of train motion and Splunk as the platform for handling the processing of all that machine generated data, trains can travel down the track more efficiently, translating to 10-20% cost savings in lowered fuel consumption. When you consider that a major railroad company, like say Norfolk Southern, can consumer more fuel annually than the US Navy, it’s not hard to imagine saving a billion dollars at the pump by driving smarter…and Splunk is the framework that enabled this sort of actionable intelligence on MGD.
Not only did New York Air Brake give us a great story of how Splunk is being used, Greg Hrebek, their Director of Engineering gave the best (and maybe most comical) description of Splunk I have seen to date:
The Splunk platform consists of an infrastructure layer that can ingest and index data from almost any source and an application framework on top that can query that data either through native query languages or via a robust set of applications designed to use the Splunk indexes as the information substrate. Think of it this way, Splunk can ingest data from multiple sources with the only requirement for data structure being that it must either be text data or data that is easily translated to text (think log files or database files with time-stamped transactions). Once the data is ingested and indexed, queries can be executed against the data through their native Structured Processing Language (SPL – this is their proprietary query language) or it can be leveraged by one of the growing number of applications designed to leverage the Splunk indexes as a data substrate. These applications might be something like a dashboard for marketing teams looking to understand their promotional effectiveness by correlating web activity monitoring logs with ad placement information to determine which ads are having the highest conversion rates to an order.
A deployment for Splunk consists of the following key components:
- Forwarders – these are either applications or physical devices that send data from a source to an indexer for processing.
- Indexers – these are the heavy lifters in the Splunk world because they index the data and manage laying it out in storage. Indexes live in different tiers based on space available in each tier (tier diagram provided below…for those visual folks).
- Hot – this is where all data is written and the most recent data is kept here.
- Warm – the next tier down, read only and likely still searched
- Cold – rarely searched data as it has aged or been archived (rolled) to this bucket. While read only and still searchable, this is considered the archive tier.
- Frozen – this is data that is pushed to a dead media like tape or deleted. There is a thawing process possible if not deleted completely to allow data to be pushed back into higher tier buckets.
- Search Head – These are where the queries are executed in Structured Processing Language (SPL) or are forwarded queries by one of the many Splunk applications. There are options for having multiple search heads in HA environments and distributed geographically.
While Splunk has its roots in MGD and IT teams have been deploying it to solve real IT challenges in log management, the application of their technology to broader analytics use cases has gained a lot of traction in the enterprise. To meet these broader analytical needs and use cases, Splunk has continue to invest in integrations and vendor partnerships that make using Splunk for business analytics a lot easier. One of those innovations was DB Connect. This tool allows Splunk to pull structured data from RDBMS systems and APIs like Salesforce.com and add greater context to queries. Additionally, end users who may not be Splunk Ninja’s (that is the proper moniker for an expert with Splunk) can take advantage of tools they are already comfortable with like Excel, Tableau, SAS and others to leverage data models from Splunk for their analytics and visualization tasks thanks to the ODBC Driver from Splunk.
These two critical pieces of functionality blossom the usefulness of Splunk as a true enterprise-class analytics platform and broaden its relevance in any organization…and if organizations already have Splunk users or Ninjas using Splunk in IT, then the skills necessary to extend this functionality throughout the enterprise are less about technology skills and more about creativity and business/domain experience.
Hadoop versus Splunk – should it really be a fight?
The value proposition of Hadoop for most organizations is that it provides a low cost way to keep large volumes of data online and creates a framework for analyzing that data to uncover new value drivers in their business. In the MGD context, Hadoop has a ton of promise because of the scale of data we might collect and analyze in a cost effective manner, but for a lot companies, it is a science project at this point. These science projects are led by “Data Scientists” who bring together a unique set of skills that combine statistics expertise, IT/hacking skills with strong business acumen. Sure, the folks at Google and Yahoo! seem to have these science projects converted to an art exhibit at this stage, but most of the enterprise customers I work with don’t have the people, process or business model in place to support that sort of data maturity. The fact remains that Hadoop is an open-source project that is gaining maturity as a framework, but is not truly an application for querying data. By that I mean, you still need some sort of application or custom scripting to execute any sort or query, visualization, correlation, or algorithm execution for data stored in Hadoop. Sure, there are a number of vendors out there that have applications and processes (Splunk introduced Hunk for this exact use case, but we will talk more about that in another article) that can deliver value here, but it often feels like a disjointed process of integrating technologies that feels more complicated that it should be; or at least more complicated than many enterprise customers have the time/energy/people to invest.
This is challenge is, in my humble opinion, the biggest reason why Hadoop is still a science project for most folks; getting the data into Hadoop is the easy part, but getting value out remains rather imposing for many enterprise customers. It is this challenge that I think is prompting such rapid adoption of Splunk and it’s recent ascent from being an IT tool just for logs and MGD to a true analytics platform with broad applicability in the enterprise
How Isilon’s Data Lake Strategy is absolutely relevant to Splunk.
While I have yet to meet any Splunk users that have environments that eclipse the multi-PB range, the fact that Isilon can deliver a single volume, single file system for Splunk index buckets of up to 30PB today is pretty awesome. Now, many customers will only leverage Isilon as the storage substrate for their Cold buckets and that is a magnificent choice because Isilon is a cost-effective archive tier of storage when leveraging Isilon NL-series nodes.
EMC has even built reference architectures for Splunk solutions that leverage Isilon for Cold buckets and then either XtremeIO’s All-Flash Applicance or ScaleIO’s distributed virtual SAN technologies for the Hot and/Warm buckets. However, Isilon can certainly be sized with high performance X-series of S- series nodes to meet the performance needs of most any Splunk environment and actually delivers a pretty nice benefit when used for the Hot and Warm buckets as well…no need for backups because we are a file system with mature snapshot capabilities. Based on the current data protection guidance from Splunk, the only way to really backup a Hot bucket is with snapshots, unless you are willing to roll that bucket to Warm and accept the performance penalty.
We bring the same enterprise-class to Splunk that have made Isilon the analyst’s top pick for Scale-Out NAS storage…data protection, performance tiering, encryption, compliant retention, dedupe, and secure authentication. Isilon is an absolutely great fit for Splunk storage infrastructure and is getting a lot of traction with customers as they look to consolidate infrastructure and data islands…another reason we think Isilon should be the storage substrate for the Enterprise Data Lake.
The Internet of Things and all the associated MGD is out there and smart enterprises are finding ways to derive value from it. With the right Data Lake and brilliant technologies like Splunk in place, smart enterprises can enable their people and processes to make data-driven decisions to go out-innovate, out-sell, out-smart their competition. That is really sexy and totally attainable. Isilon is fundamental to a successful Data Lake strategy and using Splunk to put data at our users’ fingertips will allows us to remain focused on transforming our business with data. Avoid the science projects…enjoy the bacon.
– all visuals including “Splunk” logo or content are pulled directly from presentation given at the .conf event in Las Vegas, October 2014.