The Need for Structure in Unstructured Data – A Call for Lambda Architecture

There’s been a lot of buzz in the tech world lately about analytics. Plenty of people are encouraging you to take a look at your data and build a “Big Data” solution to derive value from that data.  So naturally, you go out and deploy a Hadoop cluster and BOOM, you have an analytics environment. You are “Big Data Ready”. And the truth is you are ready!  Ok, sorta…
The extremely interesting and somewhat complicated part of analytics, and big data, is that there is not a “one size fits all” tool or product to your overall analytics environment. Much like your traditional IT environment, it involves multiple hardware, software, and application teams working together to stand up the overall solution. So while Hadoop is great at batch processing and chewing through large data sets, it is going to underwhelm you for real-time analysis.


I don’t want this to sound like I’m bashing Hadoop at all! I just want to make sure that we do all we can to make sure that we use the proper tools in their correct roles. Simply put, you wouldn’t want a Ferrari to pull a camping trailer, much like you wouldn’t want to go street race your F-150. Fortunately, there has been a large number of open source projects to hit the streets that really excel where Hadoop doesn’t. The cool part is many of them are really well integrated into the Hadoop ecosystem of tools and close the gaps in Hadoop around streaming, data ingestion, governance, visualization, and much more! Now the fun part comes of trying to make sense of what each tool does well, what overlaps exist in capabilities, and how to leverage the ecosystem effectively. But don’t let that scare you off, there’s a great blueprint to follow that has helped me really gain insight into this crazy world of analytics and Big Data.
Enter “The Lambda Architecture”…
Nathan Marz, once a software engineer at Twitter, was faced with the task to build a solution to optimize his data processing environments. This solution. which he named “Lambda Architecture”, builds a fault tolerant, scale-out system that achieves low latency reads and writes across a vast range of workloads and use cases. Basically, he created a tiering platform for your intensive data processing environment. While it sounds extremely complicated, lets take a look at the diagram below to see how this works:
  1. All data entering the system is dispatched to both the batch layer and the speed layer for processing.
  2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
  3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
  4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
  5. Any incoming query can be answered by merging results from batch views and real-time views.
While I hope this diagram helped give you an idea of how Lambda Architecture functions, lets put some the various Apache and Commercial projects on there to help give you an idea of how this all ties together:
Lambda logos
I’m like The Lambda Architecture and the simple framework that it brings to the table; tiering in your analytics environment. It helps place the correct workload within the appropriate framework and gives us a simple way to think of this vast and growing set of tools in the Big Data ecosystem. For organizations looking to implement a Big Data Analytics strategy, this architecture may help guide setting appropriate SLAs for various workflows and help improve the ROI of these sorts of projects by effectively communicating each tool’s use and capabilities; driving more robust adoption and leverage.
To put this into a real scenario, check out this post using lambda architecture for real-time analysis on Twitter hashtags.
Now get out there and build your very own Lambda Architecture…I’d love to hear how it goes!
Your Big Data Hipster Friend,
Kyle Prins